Re: Man pages and UTF-8

To: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: Colin Watson <cjwatson@debian.org>
Date: Fri, 14 Sep 2007 10:39:10 +0100
Message-id: <[🔎] 20070914093910.GC30430@riva.ucam.org>
Mail-followup-to: debian-mentors@lists.debian.org
In-reply-to: <[🔎] 20070912002526.GA692@angband.pl>
References: <46BC35D3.2000302@cowlark.com> <20070814152527.GA10327@nekral.homelinux.net> <20070814225053.GA13437@angband.pl> <[🔎] 20070910180357.GO6091@riva.ucam.org> <[🔎] 20070910195649.GA26767@angband.pl> <[🔎] 20070911085544.GA20325@riva.ucam.org> <[🔎] 20070912002526.GA692@angband.pl>

On Wed, Sep 12, 2007 at 02:25:26AM +0200, Adam Borowski wrote:
> On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote:

> > > > I do need to find the stomach to look at upgrading groff again, but it's
> > > > not *necessary* (or indeed sufficient) for this. The most important bit
> > > > to start with is really the changes to man-db.
> > > 
> > > We do need to change them both at once.
> > 
> > No, we don't. Seriously, I understand the problem and it's not
> > necessary. man-db can stick iconv pipes in wherever it likes and it's
> > all fine. When we upgrade groff at some future point we can just declare
> > versioned dependencies or conflicts as necessary, but it is *not*
> > necessary for this transition. A basic rule of release management is
> > that the more you decouple the easier it will be.
> 
> Yet if groff cannot accept any encoding other than ISO-8859-1 with hacks for
> ja/ko/zh, you end with data loss for anything not representable in 8859-1.

Right, but that's the current situation anyway. I'm not saying that we
don't want to fix this eventually, just that it's easier to do it by
baby steps.

> > > The meat of Red Hat changes to groff is:
> > > 
> > > ISO-8859-1/"nippon" -> LC_CTYPE
> > > 
> > > and then man-db converts everything into the current locale charset.
> > 
> > (Point of information: Red Hat doesn't use man-db.)
> 
> I didn't look that far, I didn't bother with installing a whole Red Hat
> system, just did:
> 
> ./test-groff -man -Tutf8 <foo.7
> 
> which seems to work perfectly.  After extending the upper range from uFFFF
> to u10FFFF it works like: http://angband.pl/deb/man/test.png

OK, I hope whatever patches they have used to make UTF-8 input work
correctly have gone upstream or are from upstream; I can only go on what
upstream have told me. They may just be relying on preconv, which is
fair enough.

> > Obviously we have to cope with what we've got, so ascii8 is a necessary
> > evil, but it is just plain wrong to use it when we don't have to.
> 
> So let's skip it?

We're using it right now. I want to transition away from it, but I want
to do that in a separate step.

> > > My own tree instead hardcodes it to UTF-8 under the hood; now it seems
> > > to me that it would probably be best to allow groff1.9-ish "-K
> > > charset", so man-db would be able to say "-K utf-8" while other users
> > > of groff would be unaffected (unlike Red Hat).
> > 
> > None of this is immediately necessary. Leave groff alone for the moment
> > and the problem is simpler. iconv pipes are good enough for the time
> > being. When we do something better, it will be a proper upgrade of groff
> > converging on real UTF-8 input with proper knowledge of typographical
> > meanings of glyphs (as upstream are working on), not this badly-designed
> > hodgepodge.
> 
> Isn't reading input into a string of Unicode codepoints good enough for now? 
> It's a whole world better than operating on opaque binary strings (ascii8),
> and works well where RTL or combining chars support is not needed.

And it makes complete sense to do so *eventually*. I just don't want to
do it at the same time as everything else because debugging intermediate
problems will be horribly confusing, that's all.

I think the disconnect throughout this thread is that you're talking
about the desired final state whereas I'm just focusing on making this
one initial step work properly so that future steps can be done without
requiring significant coordination from other packages. These two things
don't have to be in conflict. Let me get this one step sorted out and it
will be easier to get to your desired final state from there.

> > > So you would leave that 822 manpages broken.
> > 
> > If the alternative is breaking the 10522 pages listed in your analysis
> > that are ISO-8859-* but not declared as such in their directory name,
> > absolutely!
> 
> Yeah, breaking those 10522 pages would be outright wrong.  But with a bit of
> temporary ugliness in the pipeline we can have both the 10522 ones in legacy
> charsets and the 822 prematurely transitioned working.

That would indeed be ideal.

> > > My pipeline is a hack, but it transparently supports every manpage except
> > > the several broken ones.  If we could have UTF-8 man in the policy, we would
> > > also get a guarantee that no false positive appears in the future.
> > 
> > So, last night I was thinking about this, and wanted to propose a
> > compromise where we recommend in Debian policy that pages be installed
> > in a directory that explicitly specifies the encoding (you might not
> > like this, but it makes man-db's life a lot easier, it's much easier to
> > tell how complete the transition is, and it's what the FHS says we
> > should do), but for compatibility with the RPM world we transparently
> > accept UTF-8 manual pages installed in /usr/share/man/$LL/ anyway.
> 
> So you would want to have the old ones put into /usr/share/man/ISO-8859-1/
> (or man.8859_1) instead of /usr/share/man/?  That would work, too.

No, because the installed locations as used right now need to continue
working. Sure, people can move to /usr/share/man/$LL.ISO-8859-1/ if they
have some reason to continue using ISO-8859-1 and yet aren't interested
in compatibility with old man-db (I can't imagine what), but that hardly
seems worth the bother. The entire point is compatibility with old
packages.

> I'm opposed to spelling /usr/share/man/UTF-8/ in full on aesthethic grounds,
> as the point in Unicode is to forget something called "charset" which needed
> to be set ever existed, but it's you who decide here after all.

Well, the exact same aesthetic consideration would apply to locale names
and yet I use LANG=en_GB.UTF-8 because en_GB is defined to be
ISO-8859-1. That naming decision was made for the same reasons as I'm
applying here. This is, I think, life.

> >   * The implementation would use iconv() on reasonably-sized chunks of
> >     data (let's say 4KB). If it encounters EILSEQ or EINVAL, it will
> >     throw away the current output buffer, fall back to the next encoding
> >     in the list, and attempt to convert the same input buffer again.
> 
> EINVAL is possible only if a sequence is cut by the end of the buffer, so
> it's ok.

Fair point; the response to EINVAL should be "read more and try again
unless there is no more to read". This is relevant if, say, you get a
top-bit-set ISO-8859-1 character at the end of a chunk but not at the
end of the input stream.

> > This would have the behaviour that output is issued smoothly, and for -f
> > UTF-8:* the encoding is detected correctly provided that there is a
> > non-UTF-8 character within the first 4KB of the file. I haven't tested
> > this, but intuitively it seems that it should be a good compromise.
> 
> Bad news: 4KB is not enough.  Often, 8-bit characters are used only as (C)
> or in the authors list.  The first offending characters are at uncompressed
> offsets:

Following a discussion on linux-utf8 involving groff upstream, man-db
already supports Emacs-style encoding declarations such as:

  '\" -*- coding: UTF-8 -*-

... so using that where the heuristics fail would solve the problem. IMO
this is only a convenience so it's OK for it not to be 100%.

>      33219 man3/Mail::Message::Field.3pm.gz
>      33226 man1/full_index.1grass.gz
>      36027 man1/mined.1.gz
>      37172 man3/Date::Pcalc.3pm.gz
>      39127 man1/SWISH-FAQ.1.gz
>      40214 man3/Event.3pm.gz  
>      41114 man3/Class::Std.3pm.gz
>      42997 man3/SoQtViewer.3.gz  
>      47367 man3/Net::SSLeay.3pm.gz
>      53003 man1/SWISH-CONFIG.1.gz 
>      57955 man7/groff_mm.7.gz
>      59990 man3/HTML::Embperl.3pm.gz
>      63733 man3/Date::Calc.3pm.gz   
>      67045 man1/pcal.1.gz              (pcal)
>      72423 man1/spax.1.gz              (star)
>     194227 man8/backuppc.8.gz          (backuppc)

This is a pretty small number, really. I don't mind a small number of
manual pages needing to change in some small way to be supported
properly.

(I suspect, incidentally, that your code does not check for pages that
include other pages using .so. zshall(1) is my usual test case here.)

> So we can either:
> a) slurp the whole file (up to 585KB, save for wireshark-filter which is a
>    6MB monstrosity)
> b) use an ugly 190KB buffer 
> c) bribe the backuppc maintainer to go down to 71KB
> d) same with pcal and star, for a round number of 64KB

Option d) seems reasonable to me (bearing in mind that we need to
allocate an output buffer for iconv() which in the worst case could need
to be four times the size of the input buffer; my current iconv client
code in whatis just allocates a buffer the same size as the input buffer
and repeatedly reallocs it to double the previous size on E2BIG).

> > Is this what your "hack" pipeline implements? If so, I'd love to see it;
> > if not, I'm happy to implement it.
> 
> The prototype is:
>                               pipeline_command_args (p, "perl", "-CO", "-e",
>                                               "use Encode;"
>                                               "undef $/;"  
>                                               "$_=<STDIN>;"
>                                               "eval{print decode('utf-8',$_,1)};"
>                                               "print decode($ARGV[0],$_) if $@",
>                                               page_encoding,
>                                               NULL);
> so it's similar.  "Slurp everything into core" in C is a page of code, your
> idea of a static buffer makes it simpler; and I'm not in a position to
> complain that it's another hack :p

Current man-db makes the buffering pretty trivial:

  const char *buf = pipeline_peek (p, 65536);

I'll try to implement something like this in C, then.

> I thought about forking off to avoid a separate binary, but a separate
> binary could be potentially reused by someone else.

A separate binary also makes it possible to copy and paste the
formatting pipeline out of the output of 'man --debug', which is a
useful property I want to preserve.

> For -c, glibc's //TRANSLIT or my translit[1] are always better: they drop
> accents/etc, and if they fail to find a valid replacement it will at least
> output "?" instead of silently dropping the character.

Sure, I'm OK with using transliteration instead, though I'd prefer to
stick with what's in the C library (it can always be extended).

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]

Reply to:

Follow-Ups:
- Re: Man pages and UTF-8
  - From: Colin Watson <cjwatson@debian.org>

References:
- Re: Man pages and UTF-8
  - From: Colin Watson <cjwatson@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Colin Watson <cjwatson@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>

Prev by Date: Re: Problem with --purge and reinstalling
Next by Date: Re: RFS: gbrainy
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread