[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Control: clone -1 -2
Control: reasign -2 wml 2.12.2~ds1-2
Control: retitle -2 wml: Regression in "htmlstrip -O2" (default) with Chinese language


Boyuan Yang wrote:
> Thanks for raising this issue.

Thanks from me, too. I wasn't aware of such a regression, sorry.

> These build errors might have multiple causes,
> but I stripped the issue down to a (possible) regression of wml. Let's fix
> this issue first before talking about others.
> =======================================
> $ wml --version
> This is WML Version 2.12.2
> Copyright (c) 1996-2001 Ralf S. Engelschall.
> Copyright (c) 1999-2001 Denis Barbier.
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> GNU General Public License for more details.
> $ cat /etc/issue
> Debian GNU/Linux bullseye/sid \n \l
> $ cat a.wml
> <p>
> 包
> </p>
> $ hexdump -C a.wml
> 00000000  3c 70 3e 0a e5 8c 85 0a  3c 2f 70 3e 0a           |<p>.....</p>.|
> 0000000d
> $ wml a.wml > test.txt
> $ cat test.txt
> <p>
> �
> </p>
> $ hexdump -C test.txt
> 00000000  3c 70 3e 0a e5 8c 0a 3c  2f 70 3e 0a              |<p>....</p>.|
> 0000000c
> $ 
> I am using Debian Unstable but similar things also happen in Buster.

Can confirm that this is a regression between Stretch and Buster. :-(

> The single character in the a.wml above is U+5305 [1], namely "CJK Unified
> Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
> "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
> and the "0x85" was dropped. That's surely a regression.

Ack. Figured out that it's pass 8 of 9 passes in WML:

→ cat a.wml | wml -p1-8
→ cat a.wml | wml -p1-7
→ cat a.wml | wml -p1-7,9
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip

Pass 8 is htmlstrip, something similar uglifyjs, but for HTML.

Since that pass should be only for delivery performance and disk space
reasons, it likely can be left out easily.

So I see multiple ways to more or less quickly fix this issue in the
Debian web:

* Always call wml with "-p1-7,9".
* Call wml with "-p1-7,9" if any of the affected languages is build.
* Add <nostrip>…</nostrip> containers in the header and footer
  templates for the affected langauges.

To be more precise, it's the optimisation level 2 of htmlstrip:

→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 0
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2

The man page says:

       Level 2:
           Good stripping: Same as level 1 plus compression of
	   multiple whitespaces (more then one in sequence) to single
	   whitespaces [txt,tag] and stripping of trailing whitespaces
	   at the of of a line [txt,tag,pre].
           This level is the default because while providing good
	   optimization the HTML markup is not destroyed and remains
	   human readable.

So instead of skipping htmlstrip completely, everywhere, where I
suggested passing "-p1-7,9", also "-O1" could be passed to wml as
this is passed to htmlstrip:

→ cat a.wml | wml -O1

> I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
> this regression in both Sid/Testing and Stable?

I think the above is a good first workaround on buster. With this
mail, I clone the bug report and will try to figure out what change in
htmlstrip caused the regression and/or how it can be fixed.

I though currently have issues building more recent upstream versions
of WML which is the reason why wml in Unstable hasn't seen an update
yet. A more recent version is in git, but IIRC there was another
release or two recently, at which I haven't looked yet.

		Regards, Axel
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE

Attachment: signature.asc
Description: PGP signature

Reply to: