Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

To: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, 959761@bugs.debian.org, perl@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
From: Damyan Ivanov <dmn@debian.org>
Date: Tue, 5 May 2020 08:45:11 +0300
Message-id: <[🔎] 20200505054510.ndppc4gxea5iwgi7@fbd7c150-3361-11e8-8c11-5badabdd4a8d>
In-reply-to: <[🔎] 20200505013426.hf4e2za5xqomz4af@sym.noone.org>
References: <15d8a46a-6264-2bb7-d952-f4deaa7a38ef@debian.org> <15d8a46a-6264-2bb7-d952-f4deaa7a38ef@debian.org> <[🔎] 20200503225739.5484cb41fc877994ebb89ce5@mailbox.org> <[🔎] 295462cf3518b51c2c8b1f516934033748b5159c.camel@debian.org> <[🔎] 20200505010058.tqoss44lmgy5jneh@sym.noone.org> <[🔎] 20200505013426.hf4e2za5xqomz4af@sym.noone.org>

(not a Perl maintainer here)

-=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=-
> → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> 包
> → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> �
> 
> Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> perl package (not the whole Debian Perl Team), maybe they have some
> insight what actually goes wrong here and if that's indeed a Perl 
> bug.

Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data):

→ echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
包赠传阅加者

>From perlrun(1):

      -C [number/list]
            The -C flag controls some of the Perl Unicode features.

            As of 5.8.1, the -C can be followed either by a number or a list
            of option letters.  The letters, their numeric values, and effects
            are as follows; listing the letters is equal to summing the
            numbers.

                I     1   STDIN is assumed to be in UTF-8
                O     2   STDOUT will be in UTF-8
                E     4   STDERR will be in UTF-8
                S     7   I + O + E

Perhaps the strings in wml need to be decoded from UTF-8 so that they 
aren't treated as a sequence of independent bytes?

U+0085 is "Next line (NEL)", which seems to be treated as "\n".


(
Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") 
doesn't help:

 echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
 �

Explicitly using Encode helps:

 echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
 Wide character in print at -e line 1, <> line 1.
 包

(whe wide character warning is expected, because STDOUT is not instructed how to encode unicode characters)
)

-- dam

Reply to:

Follow-Ups:
- Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Jakub Wilk <jwilk@jwilk.net>
- Re: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Axel Beckert <abe@debian.org>

References:
- Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Holger Wansing <hwansing@mailbox.org>
- Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Boyuan Yang <byang@debian.org>
- Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Axel Beckert <abe@debian.org>
- Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
  - From: Axel Beckert <abe@debian.org>

Prev by Date: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Next by Date: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Previous by thread: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Next by thread: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Index(es):
- Date
- Thread