Re: Man pages and UTF-8

To: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: Russ Allbery <rra@debian.org>
Date: Mon, 13 Aug 2007 15:16:14 -0700
Message-id: <[🔎] 87ps1r9j2p.fsf@windlord.stanford.edu>
In-reply-to: <[🔎] 46C0C979.20609@cowlark.com> (David Given's message of "Mon, 13 Aug 2007 22:13:29 +0100")
References: <[🔎] 46BC35D3.2000302@cowlark.com> <[🔎] 87mywz4s5c.fsf@benfinney.id.au> <[🔎] 46BC3CC8.1000004@cowlark.com> <[🔎] 20070810112302.GA26779@angband.pl> <[🔎] 20070812015034.GA9322@debian.org> <[🔎] 87eji95akd.fsf@benfinney.id.au> <[🔎] 20070812111224.GB4617@debian.org> <[🔎] 20070812225157.GA17113@angband.pl> <[🔎] 87odhcm73m.fsf@windlord.stanford.edu> <[🔎] 20070813120752.GA17168@angband.pl> <[🔎] 871we7e3p7.fsf@windlord.stanford.edu> <[🔎] 46C0C979.20609@cowlark.com>

David Given <dg@cowlark.com> writes:

> Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales,
> so transcoding would be required.

You do get lintian warnings if you try to use ISO 8859-1 characters in man
pages currently.  Unfortunately, a lot of people just ignore those
warnings.  (One of the reasons for them is precisely to ease this
transition.)

> Further investigation reveals that man-db seems to transcode UTF-8 to
> ISO-8859-1 before passing it to groff.

Oh, so we lose if we have characters in UTF-8 that can't be represented in
ISO 8859-1.  Bleh.  That explains why we are where we are.

Thank you very much for the analysis!  It hadn't occurred to me that
man-db would be transcoding things on the way in, and now I understand
much better what's going on.

> It's all a bit of a maze, unfortunately, and I could have misunderstood
> things. But that MULTIBYTE_GROFF #define looks interesting. It *might*
> be possible to crudely hack it to work by using the nippon device and
> the EUC-JP encoding for man pages written in UTF-8. I don't know what
> the coverage of EUC-JP is like compared to UTF-8, but there might be
> mileage there.  Alternatively, ascii8 is supposed to be eight-bit clean,
> and might suffice...

I'm pretty sure that the MULTIBYTE_GROFF stuff is what didn't work quite
right and what upstream wasn't entirely happy with.  I think it was
developed for some specific Asian encodings and works okay for them, but
possibly not for arbitrary UTF-8.  I wonder if that's what Red Hat uses or
if they transcode as well and just lose on man pages that contain
non-European characters.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

References:
- Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>
- Re: Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Osamu Aoki <osamu@debian.org>
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>
- Re: Man pages and UTF-8
  - From: Osamu Aoki <osamu@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>
- Re: Man pages and UTF-8
  - From: David Given <dg@cowlark.com>

Prev by Date: Re: stripping by upstream
Next by Date: Re: RFC/RFS: aptjail: Powerful chroot() generator for Debian systems
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread