Re: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
- To: Damyan Ivanov <dmn@debian.org>, 959761@bugs.debian.org
- Cc: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, perl@packages.debian.org
- Subject: Re: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
- From: Axel Beckert <abe@debian.org>
- Date: Tue, 5 May 2020 10:53:29 +0200
- Message-id: <[🔎] 20200505085328.6jqtzcaxkluhhl6e@sym.noone.org>
- Mail-followup-to: Damyan Ivanov <dmn@debian.org>, 959761@bugs.debian.org, Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, perl@packages.debian.org
- In-reply-to: <[🔎] 20200505054510.ndppc4gxea5iwgi7@fbd7c150-3361-11e8-8c11-5badabdd4a8d>
- References: <[🔎] 15d8a46a-6264-2bb7-d952-f4deaa7a38ef@debian.org> <[🔎] 15d8a46a-6264-2bb7-d952-f4deaa7a38ef@debian.org> <[🔎] 20200503225739.5484cb41fc877994ebb89ce5@mailbox.org> <[🔎] 295462cf3518b51c2c8b1f516934033748b5159c.camel@debian.org> <[🔎] 20200505010058.tqoss44lmgy5jneh@sym.noone.org> <[🔎] 20200505013426.hf4e2za5xqomz4af@sym.noone.org> <[🔎] 15d8a46a-6264-2bb7-d952-f4deaa7a38ef@debian.org> <[🔎] 20200505054510.ndppc4gxea5iwgi7@fbd7c150-3361-11e8-8c11-5badabdd4a8d>
Hi Damyan,
Damyan Ivanov wrote:
> (not a Perl maintainer here)
Did help nevertheless. Just didn't want to spam the whole Perl Team
with potential Perl bugs. ;-)
> -=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=-
> > → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> > 包
> > → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> > �
> >
> > Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> > perl package (not the whole Debian Perl Team), maybe they have some
> > insight what actually goes wrong here and if that's indeed a Perl
> > bug.
>
> Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data):
>
> → echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> 包赠传阅加者
>
> >From perlrun(1):
>
> -C [number/list]
> The -C flag controls some of the Perl Unicode features.
>
> As of 5.8.1, the -C can be followed either by a number or a list
> of option letters. The letters, their numeric values, and effects
> are as follows; listing the letters is equal to summing the
> numbers.
>
> I 1 STDIN is assumed to be in UTF-8
> O 2 STDOUT will be in UTF-8
> E 4 STDERR will be in UTF-8
> S 7 I + O + E
Thanks! I was not aware of the -C option...
> Perhaps the strings in wml need to be decoded from UTF-8 so that they
> aren't treated as a sequence of independent bytes?
... and would have expect "use feature unicode_strings;" already
activates all of this.
> U+0085 is "Next line (NEL)", which seems to be treated as "\n".
I see.
> Strangely, replacing -CS with a call to STDIN->binmode("UTF-8")
> doesn't help:
>
> echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
> �
>
> Explicitly using Encode helps:
>
> echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
> Wide character in print at -e line 1, <> line 1.
> 包
Thanks, will try to use whatever works from these.
Regards, Axel
--
,''`. | Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' : | Debian Developer, ftp.ch.debian.org Admin
`. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5
`- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
Reply to: