[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8

David Given <dg@cowlark.com> writes:

> Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales,
> so transcoding would be required.

You do get lintian warnings if you try to use ISO 8859-1 characters in man
pages currently.  Unfortunately, a lot of people just ignore those
warnings.  (One of the reasons for them is precisely to ease this

> Further investigation reveals that man-db seems to transcode UTF-8 to
> ISO-8859-1 before passing it to groff.

Oh, so we lose if we have characters in UTF-8 that can't be represented in
ISO 8859-1.  Bleh.  That explains why we are where we are.

Thank you very much for the analysis!  It hadn't occurred to me that
man-db would be transcoding things on the way in, and now I understand
much better what's going on.

> It's all a bit of a maze, unfortunately, and I could have misunderstood
> things. But that MULTIBYTE_GROFF #define looks interesting. It *might*
> be possible to crudely hack it to work by using the nippon device and
> the EUC-JP encoding for man pages written in UTF-8. I don't know what
> the coverage of EUC-JP is like compared to UTF-8, but there might be
> mileage there.  Alternatively, ascii8 is supposed to be eight-bit clean,
> and might suffice...

I'm pretty sure that the MULTIBYTE_GROFF stuff is what didn't work quite
right and what upstream wasn't entirely happy with.  I think it was
developed for some specific Asian encodings and works okay for them, but
possibly not for arbitrary UTF-8.  I wonder if that's what Red Hat uses or
if they transcode as well and just lose on man pages that contain
non-European characters.

Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to: