[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Hi all,

(with my Debian Chinese Team hat on)

(see bottom...)

在 2020-05-03星期日的 22:57 +0200,Holger Wansing写道:
> Hi,
> Laura Arjona Reina <larjona@debian.org> wrote:
> > There are some issues with some Chinese pages when they are built in a
> > buster machine.
> > We need to fix those issues (at least the "Malformed UTF-8 character
> > [...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the
> > www-master machine to buster. See the summary of the log at the bottom
> > to know which files produce this error.
> > I have no idea of how to fix the issues, so any help from the Chinese
> > team or web team mates is greatly appreciated..
> > Additional issues may arise (e.g. I still didn't test the release-notes
> > or doc-manual), any help testing is welcome too, please create bug
> > reports for each different issue or update the existing ones. Thanks!
> > 
> > 
> > I've done a test build of the /english and /chinese subdirs in a buster
> > machine, and I have noticed some warnings/errors related to the Chinese
> > pages (some, not all of them).
> > 
> > It would be desirable to upgrade www-master machine to buster as soon as
> > possible, so any help with this (from website  or Chinese team members)
> > is very appreciated.
> > 
> > Below you can find an extract of the build log, including only the the
> > files for which I got some error or warning message.
> > 
> > After the build, I have compared the problematic HTML files of a build
> > in stretch and a build in buster with a diff tool, to see if there were
> > significant changes in the html output due to these issues.
> > 
> > Here are my results:
> > 
> > * For the messages of the type ", [zh_TW]Invalid UTF8: " when building,
> > I couldn't note any difference between the output of a stretch build and
> > the output of a buster build.
> > 
> > I would say this is not a blocker for the buster upgrade of www-master.
> Don't know what I did different than Laura, but here some of the built html
> files
> with "Invalid UTF8: ... " messages are lacking much of the content, compared
> to the one currently at www-master. 
> So maybe they are also serious.
> > * For the messages of the type "Malformed UTF-8 character [...] at
> > ../../bin/tocn.pl [...]" I have seen important changes in the HTML diff,
> > I think the output in the stretch build is totally broken (fortunately,
> > there are not many files in that situation).
> > 
> > I would say this is a blocker for the buster upgrade of www-master, but
> > I would prefer somebody of the Chinese team to confirm (try to build
> > those files in a buster machine, and review the output).
> Maybe someone from the chinese people can solve this, but if not, I want
> to propose a possible (temporary) solution:
> If I delete the files below from the webwml/chinese tree, I can build
> chinese without any errors. So, probably we can go with a workaround like
> this:
> delete this files, to remove these upgrade blockers out of the way, upgrade 
> wolkenstein to buster, and then try to re-add the files step-by-step, maybe
> with some modifications at some point, to get the original situation back. 

Thanks for raising this issue. These build errors might have multiple causes,
but I stripped the issue down to a (possible) regression of wml. Let's fix
this issue first before talking about others.

$ wml --version
This is WML Version 2.12.2
Copyright (c) 1996-2001 Ralf S. Engelschall.
Copyright (c) 1999-2001 Denis Barbier.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.
$ cat /etc/issue
Debian GNU/Linux bullseye/sid \n \l

$ cat a.wml
$ hexdump -C a.wml
00000000  3c 70 3e 0a e5 8c 85 0a  3c 2f 70 3e 0a           |<p>.....</p>.|
$ wml a.wml > test.txt
$ cat test.txt
$ hexdump -C test.txt
00000000  3c 70 3e 0a e5 8c 0a 3c  2f 70 3e 0a              |<p>....</p>.|


The single character in the a.wml above is U+5305 [1], namely "CJK Unified
Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
"0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
and the "0x85" was dropped. That's surely a regression.

I am using Debian Unstable but similar things also happen in Buster.

I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
this regression in both Sid/Testing and Stable?

Boyuan Yang

[1] https://www.compart.com/en/unicode/U+5305

Attachment: signature.asc
Description: This is a digitally signed message part

Reply to: