[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8



On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote:
> > Woh, it's great to hear from you.  I'm afraid I've been lazy too, you should
> > be shown ready patches instead of hearing "that's mostly working"...
> 
> If you do work on patches, please make sure they're against current bzr;
> there have been a lot of changes since 2.4.4.

Noted.

> > > I do need to find the stomach to look at upgrading groff again, but it's
> > > not *necessary* (or indeed sufficient) for this. The most important bit
> > > to start with is really the changes to man-db.
> > 
> > We do need to change them both at once.
> 
> No, we don't. Seriously, I understand the problem and it's not
> necessary. man-db can stick iconv pipes in wherever it likes and it's
> all fine. When we upgrade groff at some future point we can just declare
> versioned dependencies or conflicts as necessary, but it is *not*
> necessary for this transition. A basic rule of release management is
> that the more you decouple the easier it will be.

Yet if groff cannot accept any encoding other than ISO-8859-1 with hacks for
ja/ko/zh, you end with data loss for anything not representable in 8859-1.

> > The meat of Red Hat changes to groff is:
> > 
> > ISO-8859-1/"nippon" -> LC_CTYPE
> > 
> > and then man-db converts everything into the current locale charset.
> 
> (Point of information: Red Hat doesn't use man-db.)

I didn't look that far, I didn't bother with installing a whole Red Hat
system, just did:

./test-groff -man -Tutf8 <foo.7

which seems to work perfectly.  After extending the upper range from uFFFF
to u10FFFF it works like: http://angband.pl/deb/man/test.png

> Thus what you're saying seems to be that Red Hat uses the ascii8 device,
> or its equivalent (ascii8 passes through any 8-bit encoding untouched,

Actually, their -Tascii8 is completely broken, they use -Tutf8 instead.

> although certain characters are still reserved for internal use by groff
> which is why it doesn't help with UTF-8). groff upstream has repeatedly
> rejected this as typographically wrong-headed; I don't want to
> perpetuate it. groff is supposed to know what the characters really are,
> not just treat them as binary data.

I fully agree.  The multibyte patch for 1.8 (which Red Hat refers to
everywhere as "the Debian patches") lets groff store characters as Unicode
code points; the input/output issues are what we're trying to fix in this
thread, and properties of particular characters are an orthogonal matter.

> Obviously we have to cope with what we've got, so ascii8 is a necessary
> evil, but it is just plain wrong to use it when we don't have to.

So let's skip it?

> > My own tree instead hardcodes it to UTF-8 under the hood; now it seems
> > to me that it would probably be best to allow groff1.9-ish "-K
> > charset", so man-db would be able to say "-K utf-8" while other users
> > of groff would be unaffected (unlike Red Hat).
> 
> None of this is immediately necessary. Leave groff alone for the moment
> and the problem is simpler. iconv pipes are good enough for the time
> being. When we do something better, it will be a proper upgrade of groff
> converging on real UTF-8 input with proper knowledge of typographical
> meanings of glyphs (as upstream are working on), not this badly-designed
> hodgepodge.

Isn't reading input into a string of Unicode codepoints good enough for now? 
It's a whole world better than operating on opaque binary strings (ascii8),
and works well where RTL or combining chars support is not needed.

> > Yet:
> > [~/man]$ grep ^U mans.enc |wc -l
> > 843
> > [~/man]$ grep ^U mans.enc |grep '\.UTF-8'|wc -l
> > 21
> > 
> > So you would leave that 822 manpages broken.
> 
> If the alternative is breaking the 10522 pages listed in your analysis
> that are ISO-8859-* but not declared as such in their directory name,
> absolutely!

Yeah, breaking those 10522 pages would be outright wrong.  But with a bit of
temporary ugliness in the pipeline we can have both the 10522 ones in legacy
charsets and the 822 prematurely transitioned working.

> > My pipeline is a hack, but it transparently supports every manpage except
> > the several broken ones.  If we could have UTF-8 man in the policy, we would
> > also get a guarantee that no false positive appears in the future.
> 
> So, last night I was thinking about this, and wanted to propose a
> compromise where we recommend in Debian policy that pages be installed
> in a directory that explicitly specifies the encoding (you might not
> like this, but it makes man-db's life a lot easier, it's much easier to
> tell how complete the transition is, and it's what the FHS says we
> should do), but for compatibility with the RPM world we transparently
> accept UTF-8 manual pages installed in /usr/share/man/$LL/ anyway.

So you would want to have the old ones put into /usr/share/man/ISO-8859-1/
(or man.8859_1) instead of /usr/share/man/?  That would work, too.

I'm opposed to spelling /usr/share/man/UTF-8/ in full on aesthethic grounds,
as the point in Unicode is to forget something called "charset" which needed
to be set ever existed, but it's you who decide here after all.

> I do have an efficiency concern as man-db upstream, though, which is why
> I hadn't just implemented this in the obvious crude way (try iconv -f
> UTF-8, throw away the pipeline on error, try again).

Yeah, doing the whole pipeline twice would be horrendous.

> For large manual pages it's still of practical importance that the
> formatting pipeline be smooth; that is, I don't want to have to scan the
> whole page looking for non-UTF-8 characters before I can pass it to groff.
> My ideal implementation would involve a program, let's call it "manconv",
> with behaviour much like the following:
> 
>   * Reads from standard input and writes to standard output.
> 
>   * Valid options are -f ENCODING[:ENCODING...], -t ENCODING, and -c;
>     these are interpreted as with iconv except that -f's argument is a
>     colon-separated list of encodings to try, typically something like
>     UTF-8:ISO-8859-1. Fallback is only possible if characters can be
>     expected to be invalid in leading encodings.
> 
>   * The implementation would use iconv() on reasonably-sized chunks of
>     data (let's say 4KB). If it encounters EILSEQ or EINVAL, it will
>     throw away the current output buffer, fall back to the next encoding
>     in the list, and attempt to convert the same input buffer again.

EINVAL is possible only if a sequence is cut by the end of the buffer, so
it's ok.

> This would have the behaviour that output is issued smoothly, and for -f
> UTF-8:* the encoding is detected correctly provided that there is a
> non-UTF-8 character within the first 4KB of the file. I haven't tested
> this, but intuitively it seems that it should be a good compromise.

Bad news: 4KB is not enough.  Often, 8-bit characters are used only as (C)
or in the authors list.  The first offending characters are at uncompressed
offsets:

     33219 man3/Mail::Message::Field.3pm.gz
     33226 man1/full_index.1grass.gz
     36027 man1/mined.1.gz
     37172 man3/Date::Pcalc.3pm.gz
     39127 man1/SWISH-FAQ.1.gz
     40214 man3/Event.3pm.gz  
     41114 man3/Class::Std.3pm.gz
     42997 man3/SoQtViewer.3.gz  
     47367 man3/Net::SSLeay.3pm.gz
     53003 man1/SWISH-CONFIG.1.gz 
     57955 man7/groff_mm.7.gz
     59990 man3/HTML::Embperl.3pm.gz
     63733 man3/Date::Calc.3pm.gz   
     67045 man1/pcal.1.gz              (pcal)
     72423 man1/spax.1.gz              (star)
    194227 man8/backuppc.8.gz          (backuppc)

So we can either:
a) slurp the whole file (up to 585KB, save for wireshark-filter which is a
   6MB monstrosity)
b) use an ugly 190KB buffer 
c) bribe the backuppc maintainer to go down to 71KB
d) same with pcal and star, for a round number of 64KB

> Is this what your "hack" pipeline implements? If so, I'd love to see it;
> if not, I'm happy to implement it.

The prototype is:
                              pipeline_command_args (p, "perl", "-CO", "-e",
                                              "use Encode;"
                                              "undef $/;"  
                                              "$_=<STDIN>;"
                                              "eval{print decode('utf-8',$_,1)};"
                                              "print decode($ARGV[0],$_) if $@",
                                              page_encoding,
                                              NULL);
so it's similar.  "Slurp everything into core" in C is a page of code, your
idea of a static buffer makes it simpler; and I'm not in a position to
complain that it's another hack :p

I thought about forking off to avoid a separate binary, but a separate
binary could be potentially reused by someone else.

For -c, glibc's //TRANSLIT or my translit[1] are always better: they drop
accents/etc, and if they fail to find a valid replacement it will at least
output "?" instead of silently dropping the character.

[1]. http://angband.pl/svn/kbtin/trunk/translit.h, unlike glibc it
intentionally doesn't do æ->ae, for flowing text that's worse but won't
break pre-formatted or character cell text.  And both glibc and mine are
very poor substitutes to what Links can do: Links can even turn "Дебян" or
"Δεβιαν" into "Debian".  But that's probably an overkill here...

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.



Reply to: