Re: Man pages and UTF-8
David Given <dg@cowlark.com> writes:
> Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales,
> so transcoding would be required.
You do get lintian warnings if you try to use ISO 8859-1 characters in man
pages currently. Unfortunately, a lot of people just ignore those
warnings. (One of the reasons for them is precisely to ease this
transition.)
> Further investigation reveals that man-db seems to transcode UTF-8 to
> ISO-8859-1 before passing it to groff.
Oh, so we lose if we have characters in UTF-8 that can't be represented in
ISO 8859-1. Bleh. That explains why we are where we are.
Thank you very much for the analysis! It hadn't occurred to me that
man-db would be transcoding things on the way in, and now I understand
much better what's going on.
> It's all a bit of a maze, unfortunately, and I could have misunderstood
> things. But that MULTIBYTE_GROFF #define looks interesting. It *might*
> be possible to crudely hack it to work by using the nippon device and
> the EUC-JP encoding for man pages written in UTF-8. I don't know what
> the coverage of EUC-JP is like compared to UTF-8, but there might be
> mileage there. Alternatively, ascii8 is supposed to be eight-bit clean,
> and might suffice...
I'm pretty sure that the MULTIBYTE_GROFF stuff is what didn't work quite
right and what upstream wasn't entirely happy with. I think it was
developed for some specific Asian encodings and works okay for them, but
possibly not for arbitrary UTF-8. I wonder if that's what Red Hat uses or
if they transcode as well and just lose on man pages that contain
non-European characters.
--
Russ Allbery (rra@debian.org) <http://www.eyrie.org/~eagle/>
Reply to: