[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8



[Quotes reordered slightly to suit the flow of my reply from argument to
constructive suggestion. :-)]

On Mon, Sep 10, 2007 at 09:56:50PM +0200, Adam Borowski wrote:
> On Mon, Sep 10, 2007 at 07:03:57PM +0100, Colin Watson wrote:
> > On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote:
> > > (Colin, CC-ing you as I'm not sure if you're of aware of this long thread,
> > > and both man-db and groff are your territory...)
> > 
> > I wasn't aware of it, thanks. Sorry for my delay in responding.
> 
> Woh, it's great to hear from you.  I'm afraid I've been lazy too, you should
> be shown ready patches instead of hearing "that's mostly working"...

If you do work on patches, please make sure they're against current bzr;
there have been a lot of changes since 2.4.4.

> > > On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: 
> > > > I proposed Colin to work on it during Debconf, but still had no time to do
> > > > it.
> > > 
> > > Could you tell us if anything was born?
> > 
> > I think the best summary of the current status and planning is this
> > policy proposal, on which I'd very much appreciate further comments,
> > since people involved in this thread seem to have a good grip on the
> > issues:
> > 
> >   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440420
> 
> I would object quite strongly to that solution, for two reasons:
> 
> 1. it leaves us with ugly manpage names until the heat death of the universe

I don't think it's ugly at all. In fact I think it's more elegant to
have the encoding specified explicitly. As somebody pointed out on
debian-policy, what happens if we decide in ten years' time that UTF-8
wasn't such a great idea after all?

The FHS says:

  Countries for which there is a well-accepted standard character code
  set may omit the ­<character-set> field, but it is strongly
  recommended that it be included, especially for countries with several
  competing standards.

Now, the FHS' recommendation to flatten character set names in
directories to lowercase, remove punctuation, and remove "ISO" is
obviously not based on real implementation experience (what's the use of
a character set in a directory name if not to pass it to iconv? why not
just use the canonical character set name?) and I've never proposed we
do that - I proposed changing this text a while back, and I have an
ancient thread in my inbox that I've been meaning to reply to with a
proper diff. However, the strong recommendation above is absolutely on
target. I think it's still quite clear that ISO-8859-1 and UTF-8 are
competing standards even though UTF-8 is clearly winning, and this seems
like just the kind of thing that the FHS had in mind.

> 2. it's not compatible with the .rpm world, so every single manpage that
>    sneaks through without being changed will be misencoded

This, on the other hand, I have some sympathy for (though I don't regard
it as an overriding requirement). See below.

> > I do need to find the stomach to look at upgrading groff again, but it's
> > not *necessary* (or indeed sufficient) for this. The most important bit
> > to start with is really the changes to man-db.
> 
> We do need to change them both at once.

No, we don't. Seriously, I understand the problem and it's not
necessary. man-db can stick iconv pipes in wherever it likes and it's
all fine. When we upgrade groff at some future point we can just declare
versioned dependencies or conflicts as necessary, but it is *not*
necessary for this transition. A basic rule of release management is
that the more you decouple the easier it will be.

> > The downside to not upgrading groff, as you note, is that you'll only be
> > able to actually use those codepoints which appear in the legacy
> > character set corresponding to the language you're using (e.g. German
> > manual pages will only be able to use Unicode codepoints that are
> > convertible to ISO-8859-1). This is annoying and I fully agree that it's
> > a bug, but it's not urgent, and I want to get over the first phase of
> > the transition before worrying about that.
> 
> The meat of Red Hat changes to groff is:
> 
> ISO-8859-1/"nippon" -> LC_CTYPE
> 
> and then man-db converts everything into the current locale charset.

(Point of information: Red Hat doesn't use man-db.)

Thus what you're saying seems to be that Red Hat uses the ascii8 device,
or its equivalent (ascii8 passes through any 8-bit encoding untouched,
although certain characters are still reserved for internal use by groff
which is why it doesn't help with UTF-8). groff upstream has repeatedly
rejected this as typographically wrong-headed; I don't want to
perpetuate it. groff is supposed to know what the characters really are,
not just treat them as binary data.

Obviously we have to cope with what we've got, so ascii8 is a necessary
evil, but it is just plain wrong to use it when we don't have to. groff
does have encoding rules, even though we've bent them a good deal and
even though you can sort of get away with violating them. I consider it
as part of my job to improve the situation there rather than make it
worse.

> My own tree instead hardcodes it to UTF-8 under the hood; now it seems
> to me that it would probably be best to allow groff1.9-ish "-K
> charset", so man-db would be able to say "-K utf-8" while other users
> of groff would be unaffected (unlike Red Hat).

None of this is immediately necessary. Leave groff alone for the moment
and the problem is simpler. iconv pipes are good enough for the time
being. When we do something better, it will be a proper upgrade of groff
converging on real UTF-8 input with proper knowledge of typographical
meanings of glyphs (as upstream are working on), not this badly-designed
hodgepodge.

> > Compatibility's the thing here. You're right that there are a lot of
> > pages in UTF-8 and not marked as such (there are 1308 or so in
> > manpages-es alone), but that's a relatively recent phenomenon.
> > Historically, and even up until a year or two ago, pages installed in
> > /usr/share/man/$LL/ had a fixed encoding which man-db could rely on
> > (basically ISO-8859-1 with a few exceptions which were handled specially
> > by man-db, the ones under the MULTIBYTE_GROFF define). Those that have
> > moved to UTF-8 without changing directory have clearly not been tested
> > on Debian since they don't work, and so I have no compunction about
> > codifying that breakage;
> 
> Except, it's the cleanest long-term way,

As the maintainer of man-db, I'm the one who has to justify upgrade
breakage to users and to other developers, and I'm the one who gets the
bugs. I acknowledge the need for a good long-term solution, but I won't
accept a solution requiring a flag-day conversion of manual pages.

Note that installing pages in a different directory means that they will
simply not be used with older versions of man-db rather than producing
misencoded garbage. I regard this as a feature.

> Please take a look at http://angband.pl/deb/man/mans.enc; it lists the
> encodings of all man pages in arch={i386,all} packages.

Thanks for this analysis.

> The first field is:
> 8: legacy encoding
> U: UTF-8
> A: ASCII (charset-agnostic)
> 
> [~/man]$ grep ^A: mans.enc |wc -l
> 53434
> [~/man]$ grep ^8: mans.enc |wc -l
> 10546
> [~/man]$ grep ^U: mans.enc |wc -l
> 843
> 
> > I'm also not really keen on requiring everyone who installs a UTF-8 manual
> > page to declare a versioned conflict on man-db; that's a lot of arcs in
> > the dependency graph.
> > By contrast, moving to /usr/share/man/fr.UTF-8/ etc. for UTF-8 manual
> > pages is easy to describe and understand, and it doesn't break
> > compatibility.
> 
> Yet:
> [~/man]$ grep ^U mans.enc |wc -l
> 843
> [~/man]$ grep ^U mans.enc |grep '\.UTF-8'|wc -l
> 21
> 
> So you would leave that 822 manpages broken.

If the alternative is breaking the 10522 pages listed in your analysis
that are ISO-8859-* but not declared as such in their directory name,
absolutely! I have no problem whatsoever with leaving something broken
that's already broken. I suspect that most of those pages are from a
small number of packages and it wouldn't take long at all to move them
to the right place.

> > > Due to Red Hat and probably other dists using UTF-8 already, plenty of man
> > > pages are in UTF-8 when our groff still can't parse them.  Having gone
> > > through 2/3 of the archive, I got 807 such pages so far.  And every single
> > > one displays lovely "ä" or similar instead.  That's 9% of all mans with
> > > non-ASCII characters in the corpus.
> > 
> > These are bugs in the packages in question, and it would be
> > straightforward for the maintainers to correct that even with current
> > man-db just by using 'iconv -c' in debian/rules, if they'd actually
> > tested those manual pages. I attempted to clarify that in the first half
> > of the policy bug report above.
> 
> So you would want them to actually go back?  Please, don't.  Let's fix
> man-db instead.

When I made the policy proposal I was much further away from completing
man-db 2.5.0, so I thought it might be better to spot-fix the relatively
few things which were broken. I'm more optimistic now.

> My pipeline is a hack, but it transparently supports every manpage except
> the several broken ones.  If we could have UTF-8 man in the policy, we would
> also get a guarantee that no false positive appears in the future.

So, last night I was thinking about this, and wanted to propose a
compromise where we recommend in Debian policy that pages be installed
in a directory that explicitly specifies the encoding (you might not
like this, but it makes man-db's life a lot easier, it's much easier to
tell how complete the transition is, and it's what the FHS says we
should do), but for compatibility with the RPM world we transparently
accept UTF-8 manual pages installed in /usr/share/man/$LL/ anyway.

I do have an efficiency concern as man-db upstream, though, which is why
I hadn't just implemented this in the obvious crude way (try iconv -f
UTF-8, throw away the pipeline on error, try again). For large manual
pages it's still of practical importance that the formatting pipeline be
smooth; that is, I don't want to have to scan the whole page looking for
non-UTF-8 characters before I can pass it to groff. My ideal
implementation would involve a program, let's call it "manconv", with
behaviour much like the following:

  * Reads from standard input and writes to standard output.

  * Valid options are -f ENCODING[:ENCODING...], -t ENCODING, and -c;
    these are interpreted as with iconv except that -f's argument is a
    colon-separated list of encodings to try, typically something like
    UTF-8:ISO-8859-1. Fallback is only possible if characters can be
    expected to be invalid in leading encodings.

  * The implementation would use iconv() on reasonably-sized chunks of
    data (let's say 4KB). If it encounters EILSEQ or EINVAL, it will
    throw away the current output buffer, fall back to the next encoding
    in the list, and attempt to convert the same input buffer again.

This would have the behaviour that output is issued smoothly, and for -f
UTF-8:* the encoding is detected correctly provided that there is a
non-UTF-8 character within the first 4KB of the file. I haven't tested
this, but intuitively it seems that it should be a good compromise.

Is this what your "hack" pipeline implements? If so, I'd love to see it;
if not, I'm happy to implement it.

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: