Re: Man pages and UTF-8

To: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: David Given <dg@cowlark.com>
Date: Mon, 13 Aug 2007 22:13:29 +0100
Message-id: <[🔎] 46C0C979.20609@cowlark.com>
In-reply-to: <[🔎] 871we7e3p7.fsf@windlord.stanford.edu>
References: <[🔎] 46BC35D3.2000302@cowlark.com> <[🔎] 87mywz4s5c.fsf@benfinney.id.au> <[🔎] 46BC3CC8.1000004@cowlark.com> <[🔎] 20070810112302.GA26779@angband.pl> <[🔎] 20070812015034.GA9322@debian.org> <[🔎] 87eji95akd.fsf@benfinney.id.au> <[🔎] 20070812111224.GB4617@debian.org> <[🔎] 20070812225157.GA17113@angband.pl> <[🔎] 87odhcm73m.fsf@windlord.stanford.edu> <[🔎] 20070813120752.GA17168@angband.pl> <[🔎] 871we7e3p7.fsf@windlord.stanford.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Russ Allbery wrote:
[...]
> Okay, your analysis matches what I thought was going on.  However, David
> Given seems to be seeing something else where some man pages are already
> encoded in UTF-8.  So I guess I'm confused as to what's going on and what
> the current status is.

I've only got a handful of them. Here's one:

vim-common: /usr/share/man/it.UTF-8/man1/rvim.1.gz

That's vim-common 1:7.0-122+1etch2.

Here's the relevant comment from the source of man-db:

/* Due to historical limitations in groff (which may be removed in the
 * future), there is no mechanism for a man page to specify its own
 * encoding. This means that each national language directory needs to carry
 * with it information about its encoding, and each groff device needs to
 * have a default encoding associated with it. Out of the box, groff
 * formally allows only ISO-8859-1 on input; however, patches originating
 * with Debian and imported by many other GNU/Linux distributions change
 * this somewhat.
 *
 * Eventually, groff will support proper Unicode input, and much of this
 * horror can go away.
 *
 * Do *not* confuse source encoding with groff encoding. The encoding
 * specified in this table is the encoding in which the source man pages in
 * each language directory are expected to be written. The groff encoding is
 * determined by the selected groff device and sometimes also by the user's
 * locale.
 *
 * The standard output encoding is the encoding assumed for cat pages for
 * each language directory. It must *not* be used to discover the actual
 * output encoding displayed to the user; that is determined by the locale.
 * TODO: it would be useful to be able to change the standard output
 * encoding in the configuration file.
 *
 * This table is expected to change over time, particularly as man pages
 * begin to move towards UTF-8. Feel free to patch this for your
 * distribution; send me updates for languages I've missed.
 *
 * Explicit encodings in the directory name (e.g. de_DE.UTF-8) override this
 * table.
 */

(man-db-2.4.3/src/encodings.c)

> If our groff really can handle UTF-8 input and is doing so for some
> locales, I'd love to declare all regular man pages are in UTF-8 and be
> done with it; that's a change that we can probably make without backward
> compatibility issues right now, since currently those code points are
> disallowed.

Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales, so
transcoding would be required.

Further investigation reveals that man-db seems to transcode UTF-8 to
ISO-8859-1 before passing it to groff. man-db has three tables. This one tells
it what encoding to use for each locale:

{ "C",          "ISO-8859-1",   "ANSI_X3.4-1968"        }, /* English */
{ "POSIX",      "ISO-8859-1",   "ANSI_X3.4-1968"        }, /* English */
 #ifdef MULTIBYTE_GROFF
        /* These languages require a patched version of groff with the
         * ascii8 and nippon devices.
         */
{ "ja",         "EUC-JP",       "EUC-JP"                }, /* Japanese */
{ "ko",         "EUC-KR",       "EUC-KR"                }, /* Korean */
...

The two columns seem to be: encoding man page is written in, encoding to use
when saving in cat page. This one tells it what output device to use:

        { "ANSI_X3.4-1968",     "ascii"         },
        { "ISO-8859-1",         "latin1"        },
        { "ISO-8859-15",        "latin1"        },
        { "UTF-8",              "utf8"          },
#ifdef MULTIBYTE_GROFF
        { "EUC-JP",             "nippon"        },
#endif /* MULTIBYTE_GROFF */

And this one tells it what encoding to pass in to each groff device:

        { "ascii",      "ISO-8859-1",   "ANSI_X3.4-1968"        },
        { "latin1",     "ISO-8859-1",   "ISO-8859-1"            },
        { "utf8",       "ISO-8859-1",   "UTF-8"                 },
#ifdef MULTIBYTE_GROFF
        { "ascii8",     NULL,           NULL                    },
        { "nippon",     "EUC-JP",       "EUC-JP"                },

(Columns are: encoding to pass into groff, encoding passed out of groff.)

Note that if utf8 is selected as the output device, which appears to happen if
the source encoding is UTF-8, the groff source encoding is specified as
ISO-8859-1 and a transcode happens.

It's all a bit of a maze, unfortunately, and I could have misunderstood
things. But that MULTIBYTE_GROFF #define looks interesting. It *might* be
possible to crudely hack it to work by using the nippon device and the EUC-JP
encoding for man pages written in UTF-8. I don't know what the coverage of
EUC-JP is like compared to UTF-8, but there might be mileage there.
Alternatively, ascii8 is supposed to be eight-bit clean, and might suffice...

- --
┌── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ─── http://www.cowlark.com ───────────────────
│
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGwMl4f9E0noFvlzgRAp5TAKC3gWIPYf7lUBcguf7HySWkzZk5WwCgw4I3
WPtVKwn8MquypQdtbPkl+z8=
=F9pn
-----END PGP SIGNATURE-----

Reply to:

Follow-Ups:
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>

References:
- Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>
- Re: Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Osamu Aoki <osamu@debian.org>
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>
- Re: Man pages and UTF-8
  - From: Osamu Aoki <osamu@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Re: RFS: eterm-themes
Next by Date: Re: stripping by upstream
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread