Bug#820119: tidy reports valid NCR as invalid
Laura asked for my help on this issue. What I found is that setting the
environment variable SP_CHARSET_FIXED to 1 makes the onsgmls program use
the Unicode 2.0 character set, as the referenced web page says.
However, it uses only the first 65536 characters (the iso10646-ucs-2
character set), so character number 128513 triggers the error since it
is outside that range. In order to make that work, you need to ensure
SP_CHARSET_FIXED is unset in the validate script. However, XML files
need SP_CHARSET_FIXED set. So, I suggest something like this (patch
attached):
if ($xhtml{$htmlLevel}) {
$ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
$ENV{'SP_CHARSET_FIXED'} = 1;
$ENV{'SP_ENCODING'} = 'xml';
} else {
$ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
if (defined $charset) {
$ENV{'SP_BCTF'} = $charset;
} else {
$ENV{'SP_BCTF'} = "utf-8";
}
}
That also changes the default character set for HTML from ISO-8859-1 to
UTF-8 because the former is not a valid BCTF option. It appears the
validate script only uses that default if there is not a character set
defined in the HTML file itself and there is no character set option
passed to the script.
I didn't set up the whole web site build on my machine to test if this
change has any negative effects on pages other than en_GB.it.html , so
it needs broader testing.
diff --git a/scripts/validate b/scripts/validate
index 7d20f1c..a41c1cb 100755
--- a/scripts/validate
+++ b/scripts/validate
@@ -364,16 +364,16 @@ foreach $file (@files) {
# environment accordingly.
if ($xhtml{$htmlLevel}) {
$ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
+ $ENV{'SP_CHARSET_FIXED'} = 1;
$ENV{'SP_ENCODING'} = 'xml';
} else {
$ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
if (defined $charset) {
- $ENV{'SP_ENCODING'} = $charset;
+ $ENV{'SP_BCTF'} = $charset;
} else {
- $ENV{'SP_ENCODING'} = "ISO-8859-1";
+ $ENV{'SP_BCTF'} = "utf-8";
}
}
- $ENV{'SP_CHARSET_FIXED'} = 1;
if ($verbose) {
if ($file eq '-') {
Reply to: