Re: Man pages and UTF-8
-----BEGIN PGP SIGNED MESSAGE-----
Adam Borowski wrote:
> Due to Red Hat and probably other dists using UTF-8 already, plenty of man
> pages are in UTF-8 when our groff still can't parse them. Having gone
> through 2/3 of the archive, I got 807 such pages so far. And every single
> one displays lovely "Ã¤" or similar instead. That's 9% of all mans with
> non-ASCII characters in the corpus.
You mean by that that they're encoded as UTF-8 where man-db expects them in
whatever encoding is in its hard-coded table, correct? How are you detecting them?
>> UTF-8 is supported on output, so it is really transparent for users.
> If you consider having all unsupported characters silently dropped as being
This may not be as bad as all that, actually.
Currently man-db will cope fine with UTF-8 man pages (if it's expecting them)
and will output UTF-8. Of course, it'll lose all characters not in ISO-8859-1,
but that's a man-db bug.
This means that, assuming they all actually *are* in ISO-8859-1, we should be
able to transcode all such man pages to UTF-8, update man-db's table so it
expects them, and not lose any functionality. This means that without having
to wait for the technology, we can do this:
- transcode all man pages currently in ISO-8859-1 into UTF-8
- move all non-ISO-8859-1 man pages into directories with explicit encodings
et voila (which will soon be able to be reliably spelt voilá), we have now
achieved total UTF-8 dominance. Admittedly, because we're not handling
non-ISO-8859-1 characters, it's mere buzzword compliance, but that is now a
perfectly manageable bug in man-db and groff. It means that by making one
small change to man-db we can start the policy change and the technology
change *in parallel*, which ought to save loads of time.
...also, because man pages are now either in UTF-8 or in a directory with an
explicit encoding in the name, it ought to be easy to change linda and lintian
to check for invalid UTF-8 in the man pages, which should help with the
cat-herding aspects of the problem.
┌── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ─── http://www.cowlark.com ───────────────────
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----