[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#820119: restores original characters instead of taking care of every time numeric references coming up



Hi again.

I think my conclusion is silly, I was considering encoding the whole string only.
But we can encode  the &000130 and leave the emoji in numeric entity.
Victory is right, I'll try to think clearer later and merge the patch today. (Now afk, sorry).


El 14 de enero de 2017 13:43:24 CET, Laura Arjona Reina <larjona@debian.org> escribió:
>Hi
>
>El 13/01/17 a las 11:34, victory escribió:
>> 
>> first, it is stupid to blame about names which are valid.
>> it is also stupid that taking care of each occurrences coming up.
>> as pages are all utf-8 now, no need to keep such references,
>> this patch restores original characters instead of numeric references
>> 
>> patch below:
>> Index: english/international/l10n/scripts/gen-files.pl
>> ===================================================================
>> --- english/international/l10n/scripts/gen-files.pl	(revision 232)
>> +++ english/international/l10n/scripts/gen-files.pl	(working copy)
>> @@ -3,6 +3,7 @@
>>  use strict;
>>  use File::Path;
>>  use Getopt::Long;
>> +use Encode qw(encode);
>>  
>>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>>  
>> @@ -117,8 +118,7 @@
>>          $name =~ s/\s*<.*//;
>>          $name =~ s/&(?!#)/&amp;/g;
>>          $name =~ s/=\?.*?\?=//g;
>> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>>          $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>>          $name = '' if $name =~ m/\@/;
>>          return $name;
>> 
>> 
>
>Thanks for all the work in these and other validation/tidy issues in
>the website.
>
>I've done some tests and I'm afraid I cannot merge the patch yet.
>
>Using perl to encode to UTF8 as you propose makes tidy happy, but
>there is another script passed to the files, "validate", that produces
>theses errors:
>
>Line 10, character 12:  non SGML character number 130
>
>If we use numeric entities, tidy complains for &#000130 unless we
>suppress the character as we do now.
>
>For the emoji in translator name, "validate" complains in any case:
>
>* Using numeric entities: with the current message received:
>
>"128513" is not a character number in the document character set
>
>* Encoding to UTF8 as the proposed patch:
>
>Line 10, character 29:  non SGML character number 65533
>
>I've produced two small files:
>
>https://cosas.larjona.net/validate.utf8.html
>https://cosas.larjona.net/validate.ncr.html
>
>and passed the online validator in https://validator.w3.org/
>
>I'll try to see if we can use https://validator.w3.org/source/ and get
>better "tidy" and "validate" tools from there.
>
>For now, I've fixed the comment in the gen-files.pl:
>
>--- english/international/l10n/scripts/gen-files.pl     20 May 2016
>21:15:45 -0000      1.97
>+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
>12:41:06 -0000
>@@ -117,7 +117,10 @@
>         $name =~ s/\s*<.*//;
>         $name =~ s/&(?!#)/&amp;/g;
>         $name =~ s/=\?.*?\?=//g;
>-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
>+        # but the "tidy" tool that we use complains about them,
>+        # so we just remove those characters for now, until better
>solution
>+        # see Bug #820119
>         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
>         $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>         $name = '' if $name =~ m/\@/;
>
>Best regards

Laura Arjona Reina
https://wiki.debian.org/LauraArjona


Reply to: