[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#820119: restores original characters instead of taking care of every time numeric references coming up



Hi

El 13/01/17 a las 11:34, victory escribió:
> 
> first, it is stupid to blame about names which are valid.
> it is also stupid that taking care of each occurrences coming up.
> as pages are all utf-8 now, no need to keep such references,
> this patch restores original characters instead of numeric references
> 
> patch below:
> Index: english/international/l10n/scripts/gen-files.pl
> ===================================================================
> --- english/international/l10n/scripts/gen-files.pl	(revision 232)
> +++ english/international/l10n/scripts/gen-files.pl	(working copy)
> @@ -3,6 +3,7 @@
>  use strict;
>  use File::Path;
>  use Getopt::Long;
> +use Encode qw(encode);
>  
>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>  
> @@ -117,8 +118,7 @@
>          $name =~ s/\s*<.*//;
>          $name =~ s/&(?!#)/&amp;/g;
>          $name =~ s/=\?.*?\?=//g;
> -        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> -        $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
> +        $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>          $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>          $name = '' if $name =~ m/\@/;
>          return $name;
> 
> 

Thanks for all the work in these and other validation/tidy issues in
the website.

I've done some tests and I'm afraid I cannot merge the patch yet.

Using perl to encode to UTF8 as you propose makes tidy happy, but
there is another script passed to the files, "validate", that produces
theses errors:

Line 10, character 12:  non SGML character number 130

If we use numeric entities, tidy complains for &#000130 unless we
suppress the character as we do now.

For the emoji in translator name, "validate" complains in any case:

* Using numeric entities: with the current message received:

"128513" is not a character number in the document character set

* Encoding to UTF8 as the proposed patch:

Line 10, character 29:  non SGML character number 65533

I've produced two small files:

https://cosas.larjona.net/validate.utf8.html
https://cosas.larjona.net/validate.ncr.html

and passed the online validator in https://validator.w3.org/

I'll try to see if we can use https://validator.w3.org/source/ and get
better "tidy" and "validate" tools from there.

For now, I've fixed the comment in the gen-files.pl:

--- english/international/l10n/scripts/gen-files.pl     20 May 2016
21:15:45 -0000      1.97
+++ english/international/l10n/scripts/gen-files.pl     14 Jan 2017
12:41:06 -0000
@@ -117,7 +117,10 @@
         $name =~ s/\s*<.*//;
         $name =~ s/&(?!#)/&amp;/g;
         $name =~ s/=\?.*?\?=//g;
-        # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
+        # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
+        # but the "tidy" tool that we use complains about them,
+        # so we just remove those characters for now, until better
solution
+        # see Bug #820119
         $name =~ s/(?:&#0*130;|&#x0*82;|\N{U+0082})//ig;
         $name = 'DDTP' if $name eq 'Debian Description Translation
Project';
         $name = '' if $name =~ m/\@/;

Best regards
-- 
Laura Arjona Reina
https://wiki.debian.org/LauraArjona


Reply to: