Re: Man pages and UTF-8
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Russ Allbery wrote:
[...]
> Okay, your analysis matches what I thought was going on. However, David
> Given seems to be seeing something else where some man pages are already
> encoded in UTF-8. So I guess I'm confused as to what's going on and what
> the current status is.
I've only got a handful of them. Here's one:
vim-common: /usr/share/man/it.UTF-8/man1/rvim.1.gz
That's vim-common 1:7.0-122+1etch2.
Here's the relevant comment from the source of man-db:
/* Due to historical limitations in groff (which may be removed in the
* future), there is no mechanism for a man page to specify its own
* encoding. This means that each national language directory needs to carry
* with it information about its encoding, and each groff device needs to
* have a default encoding associated with it. Out of the box, groff
* formally allows only ISO-8859-1 on input; however, patches originating
* with Debian and imported by many other GNU/Linux distributions change
* this somewhat.
*
* Eventually, groff will support proper Unicode input, and much of this
* horror can go away.
*
* Do *not* confuse source encoding with groff encoding. The encoding
* specified in this table is the encoding in which the source man pages in
* each language directory are expected to be written. The groff encoding is
* determined by the selected groff device and sometimes also by the user's
* locale.
*
* The standard output encoding is the encoding assumed for cat pages for
* each language directory. It must *not* be used to discover the actual
* output encoding displayed to the user; that is determined by the locale.
* TODO: it would be useful to be able to change the standard output
* encoding in the configuration file.
*
* This table is expected to change over time, particularly as man pages
* begin to move towards UTF-8. Feel free to patch this for your
* distribution; send me updates for languages I've missed.
*
* Explicit encodings in the directory name (e.g. de_DE.UTF-8) override this
* table.
*/
(man-db-2.4.3/src/encodings.c)
> If our groff really can handle UTF-8 input and is doing so for some
> locales, I'd love to declare all regular man pages are in UTF-8 and be
> done with it; that's a change that we can probably make without backward
> compatibility issues right now, since currently those code points are
> disallowed.
Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales, so
transcoding would be required.
Further investigation reveals that man-db seems to transcode UTF-8 to
ISO-8859-1 before passing it to groff. man-db has three tables. This one tells
it what encoding to use for each locale:
{ "C", "ISO-8859-1", "ANSI_X3.4-1968" }, /* English */
{ "POSIX", "ISO-8859-1", "ANSI_X3.4-1968" }, /* English */
#ifdef MULTIBYTE_GROFF
/* These languages require a patched version of groff with the
* ascii8 and nippon devices.
*/
{ "ja", "EUC-JP", "EUC-JP" }, /* Japanese */
{ "ko", "EUC-KR", "EUC-KR" }, /* Korean */
...
The two columns seem to be: encoding man page is written in, encoding to use
when saving in cat page. This one tells it what output device to use:
{ "ANSI_X3.4-1968", "ascii" },
{ "ISO-8859-1", "latin1" },
{ "ISO-8859-15", "latin1" },
{ "UTF-8", "utf8" },
#ifdef MULTIBYTE_GROFF
{ "EUC-JP", "nippon" },
#endif /* MULTIBYTE_GROFF */
And this one tells it what encoding to pass in to each groff device:
{ "ascii", "ISO-8859-1", "ANSI_X3.4-1968" },
{ "latin1", "ISO-8859-1", "ISO-8859-1" },
{ "utf8", "ISO-8859-1", "UTF-8" },
#ifdef MULTIBYTE_GROFF
{ "ascii8", NULL, NULL },
{ "nippon", "EUC-JP", "EUC-JP" },
(Columns are: encoding to pass into groff, encoding passed out of groff.)
Note that if utf8 is selected as the output device, which appears to happen if
the source encoding is UTF-8, the groff source encoding is specified as
ISO-8859-1 and a transcode happens.
It's all a bit of a maze, unfortunately, and I could have misunderstood
things. But that MULTIBYTE_GROFF #define looks interesting. It *might* be
possible to crudely hack it to work by using the nippon device and the EUC-JP
encoding for man pages written in UTF-8. I don't know what the coverage of
EUC-JP is like compared to UTF-8, but there might be mileage there.
Alternatively, ascii8 is supposed to be eight-bit clean, and might suffice...
- --
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGwMl4f9E0noFvlzgRAp5TAKC3gWIPYf7lUBcguf7HySWkzZk5WwCgw4I3
WPtVKwn8MquypQdtbPkl+z8=
=F9pn
-----END PGP SIGNATURE-----
Reply to: