Bug#820119: [www.debian.org] validation errors: cannot convert character reference to number X because character not in internal character set
Hello all
Now that we are using the more modern tool onsgmls instead of nsgmls in our
"validate" script:
https://anonscm.debian.org/cgit/debwww/cron.git/tree/scripts/validate
I've returned to this bug.
The output of the validate script for the files containing "emojis" didn't
change much:
**** Errors validating
/srv/www.debian.org/www/international/l10n/po/en_GB.it.html: ***
Line 122, character 357: cannot convert character reference to number
128513 because character not in internal character set
I was a bit surprised that we are still getting these errors, because if I pass
the online w3c validator https://validator.w3.org/ or even a manual onsgmls
command in the machine that builds the website:
onsgmls -E0 -s /path/to/dtd /path/to/file
in both cases I don't get any error.
So I've looked at the "validate" script and played a bit with the options set
there, and I'd like to bring to your attention the lines L363-376:
# Determine whether we're dealing with HTML or XHTML and set the SP
# environment accordingly.
if ($xhtml{$htmlLevel}) {
$ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
$ENV{'SP_ENCODING'} = 'xml';
} else {
$ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
if (defined $charset) {
$ENV{'SP_ENCODING'} = $charset;
} else {
$ENV{'SP_ENCODING'} = "ISO-8859-1";
}
}
$ENV{'SP_CHARSET_FIXED'} = 1
If I comment this last line (and thus, letting onsgmls run in not fixed mode), I
get no errors validating the file.
I've read the documentation about these options:
http://openjade.sourceforge.net/doc/charset.htm
but frankly I don't understand it very much.
I've done:
larjona@wolkenstein:~$ sudo -u debwww env | grep SP_
and it returns nothing, so I guess only the environment set in "validate" script
is taken into account, if we don't set the variables there, defaults rule.
I've modified and run a copy of the validate script, making it print some values
when checking a file, and document type is correctly detected (HTML 4.01
Strict), as well as charset (utf-8).
I'm not sure I can safely comment the line 376
$ENV{'SP_CHARSET_FIXED'} = 1;
to avoid the errors, or even comment the whole paragraph, and trust onsgmls to
do the right thing.
Anybody with more experience in this can help?
Thanks
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona
Reply to: