[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#989132: buster-pu: package wml/2.12.2~ds1-3~deb10u1



Package: release.debian.org
Severity: normal
Tags: buster
User: release.debian.org@packages.debian.org
Usertags: pu
X-Debbugs-Cc: Axel Beckert <abe@debian.org>, debian-www@lists.debian.org

Hi,

(abe@debian.org in x-d-cc, who agreed with my helping on this topic, and
debian-www@lists.debian.org for information)


[ Reason ]

The wml package in buster contains a regression from stretch that leads
to various Unicode-related fun. It can trigger Unicode validity issues
in Chinese, which was seen and worked around for the build of the Debian
website; but it can also misrender various languages, if a non-ASCII
character happens to be the last one on a line in the WML source. That
includes the rather frequent word “à” in French (affecting hundreds of
pages on the Debian website), or “υ” as the last letter of the last word
(seen in Greek). This was also reported for Russian.

Patching the Debian website to avoid running into these situations could
be feasible but would also be impractical, as new/updated translations
would have to be monitored. And that wouldn't fix the rendering of
unsuspecting wml users outside the Debian website use case.

Patching wml instead was discussed in this MR against webmaster-team's
webwml, which includes some example of bad rendering, and many more data
points down the line (which are summed up below):
  https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596


[ Impact ]

Broken rendering when non-ASCII characters appear at the end of a line
in WML sources, which might be non-obvious (this wouldn't break a
build).


[ Tests ]

Obviously, I've used the Debian website as a “regression test” that
encompasses many files in various languages. My findings are available
there:
  https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240902
  https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240938

Basically, `file` can be used to determine whether rendering in
generated HTML files appears to be broken, mixing UTF-8 and
ISO-something (or similar) characters. With this, I confirmed that all
occurrences of “Non-ISO extended-ASCII” variations are being replaced
with full UTF-8 files (also variations, depending on long lines etc.).

I've also checked the expected changes are happening, with “broken
character” being replaced by “à” many many times in French (we have 700+
affected pages for that language alone). Non-HTML files don't appear to
change much either, as expected (those were inspected via diff, rather
than counting on file's output).

The corpus of generated HTML is 64466 files, which seems decent enough
as a real-life regression test…

Finally, I've checked that *only with the patched wml package*,
reverting the workaround that was put in place for Chinese doesn't break
the HTML generation again, and even gets us a better rendering than with
the workaround. More details in:
  https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240938


[ Risks ]

I cannot say it will not regress or slightly change the output for some
specific users/files, but I would be quite surprised to see people show
up and complain that we fixed broken rendering…


[ Checklist ]

  [x] *all* changes are documented in the d/changelog
  [x] I reviewed all changes and I approve them
  [x] attach debdiff against the package in stable
  [x] the issue is verified as fixed in unstable


[ Changes ]

The package in buster is 2.12.2~ds1-2 (through an upload to unstable
that migrated into testing), the issue was fixed in the following upload
(2.12.2~ds1-3) which happened 1+ year later, with just a single patch.
I'm proposing to backport this specific upload to buster, hence the
rather obnoxious 2.12.2~ds1-3~deb10u1 version number. I've also
considered 2.12.2~ds1-2+deb10u1 which didn't look much better (and I'm
not sure going with 2.12.2~ds1-4 for cosmetic reasons would be
reasonable).


Thanks for considering!


Cheers,
-- 
Cyril Brulebois (kibi@debian.org)            <https://debamax.com/>
D-I release manager -- Release team member -- Freelance Consultant
diff -Nru wml-2.12.2~ds1/debian/changelog wml-2.12.2~ds1/debian/changelog
--- wml-2.12.2~ds1/debian/changelog	2019-02-17 18:39:38.000000000 +0100
+++ wml-2.12.2~ds1/debian/changelog	2021-05-25 05:47:04.000000000 +0200
@@ -1,3 +1,20 @@
+wml (2.12.2~ds1-3~deb10u1) buster; urgency=medium
+
+  * Backport Unicode fix to buster, fixing rendering issues with e.g.
+    non-ASCII characters in various languages, as seen when building the
+    Debian website. Some examples include ‘υ’ in Greek and ‘à’ in French
+    when those characters are at the end of a line.
+
+ -- Cyril Brulebois <kibi@debian.org>  Tue, 25 May 2021 05:47:04 +0200
+
+wml (2.12.2~ds1-3) unstable; urgency=medium
+
+  * Add patch to fix regression in Unicode handling (especially Chinese)
+    of "htmlstrip -O2" from Stretch to Buster by adding "no feature
+    'unicode_strings'". (Closes: #959761)
+
+ -- Axel Beckert <abe@debian.org>  Tue, 05 May 2020 14:48:19 +0200
+
 wml (2.12.2~ds1-2) unstable; urgency=medium
 
   * Recommend libgd-perl: wml::des::imgbg now uses GD.pm instead of the
diff -Nru wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch
--- wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch	1970-01-01 01:00:00.000000000 +0100
+++ wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch	2021-05-25 05:46:53.000000000 +0200
@@ -0,0 +1,19 @@
+Description: Disable feature "unicode_strings" in pass 8 (htmlstrip)
+ It only works properly if file handles are set to utf8 binmode and we
+ can't expect that all input and output is UTF-8. So disable it
+ completely and go back to classic Perl ASCII-only \s meaning.
+Bug-Debian: https://bugs.debian.org/959761
+Author: Axel Beckert <abe@debian.org>
+Forwarded: no
+
+--- a/wml_include/TheWML/Backends/HtmlStrip/Main.pm
++++ b/wml_include/TheWML/Backends/HtmlStrip/Main.pm
+@@ -8,6 +8,8 @@
+ use warnings;
+ use 5.014;
+ 
++no feature qw(unicode_strings);
++
+ use parent 'TheWML::CmdLine::Base';
+ 
+ use Getopt::Long ();
diff -Nru wml-2.12.2~ds1/debian/patches/series wml-2.12.2~ds1/debian/patches/series
--- wml-2.12.2~ds1/debian/patches/series	2019-02-17 16:44:08.000000000 +0100
+++ wml-2.12.2~ds1/debian/patches/series	2021-05-25 05:46:53.000000000 +0200
@@ -6,3 +6,4 @@
 dont-use-usr-bin-env.patch
 fix-typos-found-by-lintian.patch
 fix-contrib-wml1to2-shebang-line.patch
+fix-unicode-handling-in-htmlstrip.patch

Reply to: