[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#820119: [www.debian.org] validation errors: cannot convert character reference to number X because character not in internal character set



Hello all
Now that we are using the more modern tool onsgmls instead of nsgmls in our
"validate" script:

https://anonscm.debian.org/cgit/debwww/cron.git/tree/scripts/validate

I've returned to this bug.

The output of the validate script for the files containing "emojis" didn't
change much:

**** Errors validating
        /srv/www.debian.org/www/international/l10n/po/en_GB.it.html: ***
Line 122, character 357:  cannot convert character reference to number
        128513 because character not in internal character set

I was a bit surprised that we are still getting these errors, because if I pass
the online w3c validator https://validator.w3.org/ or even a manual onsgmls
command in the machine that builds the website:

onsgmls -E0 -s /path/to/dtd /path/to/file

in both cases I don't get any error.
So I've looked at the "validate" script and played a bit with the options set
there, and I'd like to bring to your attention the lines L363-376:

    # Determine whether we're dealing with HTML or XHTML and set the SP
    # environment accordingly.
    if ($xhtml{$htmlLevel}) {
        $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
        $ENV{'SP_ENCODING'} = 'xml';
    } else {
        $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
        if (defined $charset) {
            $ENV{'SP_ENCODING'} = $charset;
        } else {
            $ENV{'SP_ENCODING'} = "ISO-8859-1";
        }
    }
    $ENV{'SP_CHARSET_FIXED'} = 1

If I comment this last line (and thus, letting onsgmls run in not fixed mode), I
get no errors validating the file.

I've read the documentation about these options:

http://openjade.sourceforge.net/doc/charset.htm

but frankly I don't understand it very much.

I've done:

larjona@wolkenstein:~$ sudo -u debwww env | grep SP_

and it returns nothing, so I guess only the environment set in "validate" script
is taken into account, if we don't set the variables there, defaults rule.

I've modified and run a copy of the validate script, making it print some values
when checking a file, and document type is correctly detected (HTML 4.01
Strict), as well as charset (utf-8).

I'm not sure I can safely comment the line 376

    $ENV{'SP_CHARSET_FIXED'} = 1;

to avoid the errors, or even comment the whole paragraph, and trust onsgmls to
do the right thing.

Anybody with more experience in this can help?

Thanks
--
Laura Arjona Reina
https://wiki.debian.org/LauraArjona


Reply to: