Re: Man pages and UTF-8

To: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: Adam Borowski <kilobyte@angband.pl>
Date: Mon, 10 Sep 2007 21:56:50 +0200
Message-id: <[🔎] 20070910195649.GA26767@angband.pl>
In-reply-to: <[🔎] 20070910180357.GO6091@riva.ucam.org>
References: <46BC35D3.2000302@cowlark.com> <20070814152527.GA10327@nekral.homelinux.net> <20070814225053.GA13437@angband.pl> <[🔎] 20070910180357.GO6091@riva.ucam.org>

On Mon, Sep 10, 2007 at 07:03:57PM +0100, Colin Watson wrote:
> On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote:
> > (Colin, CC-ing you as I'm not sure if you're of aware of this long thread,
> > and both man-db and groff are your territory...)
> 
> I wasn't aware of it, thanks. Sorry for my delay in responding.

Woh, it's great to hear from you.  I'm afraid I've been lazy too, you should
be shown ready patches instead of hearing "that's mostly working"...

> > On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: 
> > > I proposed Colin to work on it during Debconf, but still had no time to do
> > > it.
> > 
> > Could you tell us if anything was born?
> 
> I think the best summary of the current status and planning is this
> policy proposal, on which I'd very much appreciate further comments,
> since people involved in this thread seem to have a good grip on the
> issues:
> 
>   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440420

I would object quite strongly to that solution, for two reasons:

1. it leaves us with ugly manpage names until the heat death of the universe

2. it's not compatible with the .rpm world, so every single manpage that
   sneaks through without being changed will be misencoded


> David Given is pretty much spot-on in:
> 
>   http://lists.debian.org/debian-mentors/2007/08/msg00308.html

It's a working implementation of the above.  Too bad, it's an
update-the-world transition :(

> > > I tested a CVS snapshot of groff
> > 
> > On the other hand, I investigated what the headgear guys produced.  I just
> > compiled the package on Debian instead of using a real Red Hat system, so
> > due to misconfiguration things may be better than I'm reporting here.
> 
> I do need to find the stomach to look at upgrading groff again, but it's
> not *necessary* (or indeed sufficient) for this. The most important bit
> to start with is really the changes to man-db.

We do need to change them both at once.  Red Hat did it in a lockstep, after
a thought it may be a better idea to do minimal changes to groff to have its
forward-compatible with future groffs.

> The downside to not upgrading groff, as you note, is that you'll only be
> able to actually use those codepoints which appear in the legacy
> character set corresponding to the language you're using (e.g. German
> manual pages will only be able to use Unicode codepoints that are
> convertible to ISO-8859-1). This is annoying and I fully agree that it's
> a bug, but it's not urgent, and I want to get over the first phase of
> the transition before worrying about that.

The meat of Red Hat changes to groff is:

ISO-8859-1/"nippon" -> LC_CTYPE

and then man-db converts everything into the current locale charset.  My own
tree instead hardcodes it to UTF-8 under the hood; now it seems to me that
it would probably be best to allow groff1.9-ish "-K charset", so man-db
would be able to say "-K utf-8" while other users of groff would be
unaffected (unlike Red Hat).


> > > Slowly moving files from man/ to man.UTF-8/ while still supporting the
> > > legacy encoding in man/ would be a simple transition plan.
> > 
> > I'm afraid that's not an option.  So far I found 807 UTF-8 man pages, and
> > only 21 of them were marked as such.  But fear not, it appears I've got a
> > solution working, just let me download the rest of archive to check it.
> 
> Compatibility's the thing here. You're right that there are a lot of
> pages in UTF-8 and not marked as such (there are 1308 or so in
> manpages-es alone), but that's a relatively recent phenomenon.
> Historically, and even up until a year or two ago, pages installed in
> /usr/share/man/$LL/ had a fixed encoding which man-db could rely on
> (basically ISO-8859-1 with a few exceptions which were handled specially
> by man-db, the ones under the MULTIBYTE_GROFF define). Those that have
> moved to UTF-8 without changing directory have clearly not been tested
> on Debian since they don't work, and so I have no compunction about
> codifying that breakage;

Except, it's the cleanest long-term way, and it appears it _could_ be
codified without:

> but I won't break the pages that were installed using the proper encoding
> and always worked to date.

I went through the whole archive, and it appears there is not a single
source man page which appears to be well-formed UTF-8 while using a legacy
charset, albeit we got several which are encoded twice, and thus broken in
any charset.

The broken ones are:
es/man2/mmap.2.gz
es/man7/iso_8859-2.7.gz
man8/ipsec__updown.8.gz
man8/ipsec_auto.8.gz
man8/ipsec_barf.8.gz
     ... and most of ipsec_*
it/man1/snownews.1.gz
man1/gnome-keyboard-layout.1.gz
man3/Time::Seconds.3pm.gz


My pipeline is a hack, but it transparently supports every manpage except
the several broken ones.  If we could have UTF-8 man in the policy, we would
also get a guarantee that no false positive appears in the future.


Please take a look at http://angband.pl/deb/man/mans.enc; it lists the
encodings of all man pages in arch={i386,all} packages.  The first field is:
8: legacy encoding
U: UTF-8
A: ASCII (charset-agnostic)

[~/man]$ grep ^A: mans.enc |wc -l
53434
[~/man]$ grep ^8: mans.enc |wc -l
10546
[~/man]$ grep ^U: mans.enc |wc -l
843

> I'm also not really keen on requiring everyone who installs a UTF-8 manual
> page to declare a versioned conflict on man-db; that's a lot of arcs in
> the dependency graph.
> By contrast, moving to /usr/share/man/fr.UTF-8/ etc. for UTF-8 manual
> pages is easy to describe and understand, and it doesn't break
> compatibility.

Yet:
[~/man]$ grep ^U mans.enc |wc -l
843
[~/man]$ grep ^U mans.enc |grep '\.UTF-8'|wc -l
21

So you would leave that 822 manpages broken.

> > Due to Red Hat and probably other dists using UTF-8 already, plenty of man
> > pages are in UTF-8 when our groff still can't parse them.  Having gone
> > through 2/3 of the archive, I got 807 such pages so far.  And every single
> > one displays lovely "Ã¤" or similar instead.  That's 9% of all mans with
> > non-ASCII characters in the corpus.
> 
> These are bugs in the packages in question, and it would be
> straightforward for the maintainers to correct that even with current
> man-db just by using 'iconv -c' in debian/rules, if they'd actually
> tested those manual pages. I attempted to clarify that in the first half
> of the policy bug report above.

So you would want them to actually go back?  Please, don't.  Let's fix
man-db instead.

> The good news is that man-db 2.5.0 is coming along very well, and I may
> well have it complete before the initial policy amendment I proposed is
> accepted; that's fine and in that event I'll supersede it with another
> proposal and go straight to the transition plan which allows people to
> start using UTF-8 manual pages properly. I issued a pre-release version
> recently. It's probably best to check it out using bzr:
> 
>   http://www.chiark.greenend.org.uk/~cjwatson/bzr/man-db/trunk/

Woh, lemme take a look.


Cheers and schtuff.
-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

Follow-Ups:
- Re: Man pages and UTF-8
  - From: Colin Watson <cjwatson@debian.org>

References:
- Re: Man pages and UTF-8
  - From: Colin Watson <cjwatson@debian.org>

Prev by Date: Re: Are soname bumps required when library upgrades break compatability?
Next by Date: how to patch a patch
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread