[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adam Borowski wrote:
[...]
> Due to Red Hat and probably other dists using UTF-8 already, plenty of man
> pages are in UTF-8 when our groff still can't parse them.  Having gone
> through 2/3 of the archive, I got 807 such pages so far.  And every single
> one displays lovely "ä" or similar instead.  That's 9% of all mans with
> non-ASCII characters in the corpus.

You mean by that that they're encoded as UTF-8 where man-db expects them in
whatever encoding is in its hard-coded table, correct? How are you detecting them?

[...]
>> UTF-8 is supported on output, so it is really transparent for users.
> 
> If you consider having all unsupported characters silently dropped as being
> transparent.  

This may not be as bad as all that, actually.

Currently man-db will cope fine with UTF-8 man pages (if it's expecting them)
and will output UTF-8. Of course, it'll lose all characters not in ISO-8859-1,
but that's a man-db bug.

This means that, assuming they all actually *are* in ISO-8859-1, we should be
able to transcode all such man pages to UTF-8, update man-db's table so it
expects them, and not lose any functionality. This means that without having
to wait for the technology, we can do this:

 - transcode all man pages currently in ISO-8859-1 into UTF-8
 - move all non-ISO-8859-1 man pages into directories with explicit encodings

et voila (which will soon be able to be reliably spelt voilá), we have now
achieved total UTF-8 dominance. Admittedly, because we're not handling
non-ISO-8859-1 characters, it's mere buzzword compliance, but that is now a
perfectly manageable bug in man-db and groff. It means that by making one
small change to man-db we can start the policy change and the technology
change *in parallel*, which ought to save loads of time.

...also, because man pages are now either in UTF-8 or in a directory with an
explicit encoding in the name, it ought to be easy to change linda and lintian
to check for invalid UTF-8 in the man pages, which should help with the
cat-herding aspects of the problem.

- --
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGwkG7f9E0noFvlzgRAuefAKDaMn2noIGKL88qav+aaIb+4tEPGwCgi4kk
9wqG7+J19tOflGdaQIs/LqI=
=ZivR
-----END PGP SIGNATURE-----



Reply to: