Re: Man pages and UTF-8
On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote:
> > Woh, it's great to hear from you. I'm afraid I've been lazy too, you should
> > be shown ready patches instead of hearing "that's mostly working"...
> If you do work on patches, please make sure they're against current bzr;
> there have been a lot of changes since 2.4.4.
> > > I do need to find the stomach to look at upgrading groff again, but it's
> > > not *necessary* (or indeed sufficient) for this. The most important bit
> > > to start with is really the changes to man-db.
> > We do need to change them both at once.
> No, we don't. Seriously, I understand the problem and it's not
> necessary. man-db can stick iconv pipes in wherever it likes and it's
> all fine. When we upgrade groff at some future point we can just declare
> versioned dependencies or conflicts as necessary, but it is *not*
> necessary for this transition. A basic rule of release management is
> that the more you decouple the easier it will be.
Yet if groff cannot accept any encoding other than ISO-8859-1 with hacks for
ja/ko/zh, you end with data loss for anything not representable in 8859-1.
> > The meat of Red Hat changes to groff is:
> > ISO-8859-1/"nippon" -> LC_CTYPE
> > and then man-db converts everything into the current locale charset.
> (Point of information: Red Hat doesn't use man-db.)
I didn't look that far, I didn't bother with installing a whole Red Hat
system, just did:
./test-groff -man -Tutf8 <foo.7
which seems to work perfectly. After extending the upper range from uFFFF
to u10FFFF it works like: http://angband.pl/deb/man/test.png
> Thus what you're saying seems to be that Red Hat uses the ascii8 device,
> or its equivalent (ascii8 passes through any 8-bit encoding untouched,
Actually, their -Tascii8 is completely broken, they use -Tutf8 instead.
> although certain characters are still reserved for internal use by groff
> which is why it doesn't help with UTF-8). groff upstream has repeatedly
> rejected this as typographically wrong-headed; I don't want to
> perpetuate it. groff is supposed to know what the characters really are,
> not just treat them as binary data.
I fully agree. The multibyte patch for 1.8 (which Red Hat refers to
everywhere as "the Debian patches") lets groff store characters as Unicode
code points; the input/output issues are what we're trying to fix in this
thread, and properties of particular characters are an orthogonal matter.
> Obviously we have to cope with what we've got, so ascii8 is a necessary
> evil, but it is just plain wrong to use it when we don't have to.
So let's skip it?
> > My own tree instead hardcodes it to UTF-8 under the hood; now it seems
> > to me that it would probably be best to allow groff1.9-ish "-K
> > charset", so man-db would be able to say "-K utf-8" while other users
> > of groff would be unaffected (unlike Red Hat).
> None of this is immediately necessary. Leave groff alone for the moment
> and the problem is simpler. iconv pipes are good enough for the time
> being. When we do something better, it will be a proper upgrade of groff
> converging on real UTF-8 input with proper knowledge of typographical
> meanings of glyphs (as upstream are working on), not this badly-designed
Isn't reading input into a string of Unicode codepoints good enough for now?
It's a whole world better than operating on opaque binary strings (ascii8),
and works well where RTL or combining chars support is not needed.
> > Yet:
> > [~/man]$ grep ^U mans.enc |wc -l
> > 843
> > [~/man]$ grep ^U mans.enc |grep '\.UTF-8'|wc -l
> > 21
> > So you would leave that 822 manpages broken.
> If the alternative is breaking the 10522 pages listed in your analysis
> that are ISO-8859-* but not declared as such in their directory name,
Yeah, breaking those 10522 pages would be outright wrong. But with a bit of
temporary ugliness in the pipeline we can have both the 10522 ones in legacy
charsets and the 822 prematurely transitioned working.
> > My pipeline is a hack, but it transparently supports every manpage except
> > the several broken ones. If we could have UTF-8 man in the policy, we would
> > also get a guarantee that no false positive appears in the future.
> So, last night I was thinking about this, and wanted to propose a
> compromise where we recommend in Debian policy that pages be installed
> in a directory that explicitly specifies the encoding (you might not
> like this, but it makes man-db's life a lot easier, it's much easier to
> tell how complete the transition is, and it's what the FHS says we
> should do), but for compatibility with the RPM world we transparently
> accept UTF-8 manual pages installed in /usr/share/man/$LL/ anyway.
So you would want to have the old ones put into /usr/share/man/ISO-8859-1/
(or man.8859_1) instead of /usr/share/man/? That would work, too.
I'm opposed to spelling /usr/share/man/UTF-8/ in full on aesthethic grounds,
as the point in Unicode is to forget something called "charset" which needed
to be set ever existed, but it's you who decide here after all.
> I do have an efficiency concern as man-db upstream, though, which is why
> I hadn't just implemented this in the obvious crude way (try iconv -f
> UTF-8, throw away the pipeline on error, try again).
Yeah, doing the whole pipeline twice would be horrendous.
> For large manual pages it's still of practical importance that the
> formatting pipeline be smooth; that is, I don't want to have to scan the
> whole page looking for non-UTF-8 characters before I can pass it to groff.
> My ideal implementation would involve a program, let's call it "manconv",
> with behaviour much like the following:
> * Reads from standard input and writes to standard output.
> * Valid options are -f ENCODING[:ENCODING...], -t ENCODING, and -c;
> these are interpreted as with iconv except that -f's argument is a
> colon-separated list of encodings to try, typically something like
> UTF-8:ISO-8859-1. Fallback is only possible if characters can be
> expected to be invalid in leading encodings.
> * The implementation would use iconv() on reasonably-sized chunks of
> data (let's say 4KB). If it encounters EILSEQ or EINVAL, it will
> throw away the current output buffer, fall back to the next encoding
> in the list, and attempt to convert the same input buffer again.
EINVAL is possible only if a sequence is cut by the end of the buffer, so
> This would have the behaviour that output is issued smoothly, and for -f
> UTF-8:* the encoding is detected correctly provided that there is a
> non-UTF-8 character within the first 4KB of the file. I haven't tested
> this, but intuitively it seems that it should be a good compromise.
Bad news: 4KB is not enough. Often, 8-bit characters are used only as (C)
or in the authors list. The first offending characters are at uncompressed
67045 man1/pcal.1.gz (pcal)
72423 man1/spax.1.gz (star)
194227 man8/backuppc.8.gz (backuppc)
So we can either:
a) slurp the whole file (up to 585KB, save for wireshark-filter which is a
b) use an ugly 190KB buffer
c) bribe the backuppc maintainer to go down to 71KB
d) same with pcal and star, for a round number of 64KB
> Is this what your "hack" pipeline implements? If so, I'd love to see it;
> if not, I'm happy to implement it.
The prototype is:
pipeline_command_args (p, "perl", "-CO", "-e",
"print decode($ARGV,$_) if $@",
so it's similar. "Slurp everything into core" in C is a page of code, your
idea of a static buffer makes it simpler; and I'm not in a position to
complain that it's another hack :p
I thought about forking off to avoid a separate binary, but a separate
binary could be potentially reused by someone else.
For -c, glibc's //TRANSLIT or my translit are always better: they drop
accents/etc, and if they fail to find a valid replacement it will at least
output "?" instead of silently dropping the character.
. http://angband.pl/svn/kbtin/trunk/translit.h, unlike glibc it
intentionally doesn't do æ->ae, for flowing text that's worse but won't
break pre-formatted or character cell text. And both glibc and mine are
very poor substitutes to what Links can do: Links can even turn "Дебян" or
"Δεβιαν" into "Debian". But that's probably an overkill here...
1KB // Microsoft corollary to Hanlon's razor:
// Never attribute to stupidity what can be
// adequately explained by malice.