[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#959474: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster



Hi Damyan,

Damyan Ivanov wrote:
> (not a Perl maintainer here)

Did help nevertheless. Just didn't want to spam the whole Perl Team
with potential Perl bugs. ;-)

> -=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=-
> > → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> > 包
> > → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> > �
> > 
> > Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> > perl package (not the whole Debian Perl Team), maybe they have some
> > insight what actually goes wrong here and if that's indeed a Perl 
> > bug.
> 
> Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data):
> 
> → echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> 包赠传阅加者
> 
> >From perlrun(1):
> 
>       -C [number/list]
>             The -C flag controls some of the Perl Unicode features.
> 
>             As of 5.8.1, the -C can be followed either by a number or a list
>             of option letters.  The letters, their numeric values, and effects
>             are as follows; listing the letters is equal to summing the
>             numbers.
> 
>                 I     1   STDIN is assumed to be in UTF-8
>                 O     2   STDOUT will be in UTF-8
>                 E     4   STDERR will be in UTF-8
>                 S     7   I + O + E

Thanks! I was not aware of the -C option...

> Perhaps the strings in wml need to be decoded from UTF-8 so that they 
> aren't treated as a sequence of independent bytes?

... and would have expect "use feature unicode_strings;" already
activates all of this.

> U+0085 is "Next line (NEL)", which seems to be treated as "\n".

I see.

> Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") 
> doesn't help:
> 
>  echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
>  �
> 
> Explicitly using Encode helps:
> 
>  echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
>  Wide character in print at -e line 1, <> line 1.
>  包

Thanks, will try to use whatever works from these.

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE


Reply to: