Re: manpage character cleanup for UTF-8 compatibility
On Tue, Mar 25, 2003 at 04:01:51PM -0800, Vineet Kumar wrote:
> Using a UTF-8 locale, I've been finding many manpages using incorrect
> characters. Groff converts many of these characters to reasonable
> characters in ASCII locales, but some things break in UTF-8 locales.
One other thought has occurred to me while working on fixing certain
parts of man-db's locale support. Sooner or later, when groff 2 is
released (but not beforehand!), we're going to have to move towards
having all man pages encoded in UTF-8. For most languages this probably
isn't too bad: you just use de_DE.UTF-8 rather than de, or whatever
(although I'm not sure how that'd work for languages with multiple
regional variants). It's going to be a royal pain for English, though,
because currently we just put things directly in /usr/share/man, meaning
the C locale, and there's no C.UTF-8, probably for good reasons.
en_US.UTF-8 would be a poor choice because we also need en_GB.UTF-8 and
so on.
So, could I request that all English man pages follow this advice from
groff_char(7)?
All roff systems provide the concept of named characters. In
traditional roff systems, only names of length 2 were used,
while groff also provides support for longer names. It is
strongly suggested that only named characters are used for all
characters outside of the 7-bit ASCII range.
If you refrain from using any characters outside 7-bit ASCII, then the
problem of encoding doesn't arise, and the transition to groff 2 some
time in the future will be a great deal easier because we can eventually
just start assuming that English man pages are in UTF-8 and not have to
care because it won't make any difference to existing pages. For
instance, please write "na\(:ive" or "na\[:i]ve" rather than "naïve".
groff_char(7) has a list of the named characters you can use in this
way.
I think it would be a good idea for someone with time to audit
/usr/share/man for this, or to write a lintian patch. I suppose whether
things like pod2man will cause problems will prove interesting.
(Languages not encoded in ISO-8859-1 are being handled by somewhat
broken brute force at the moment anyway, so for now they should probably
ignore this advice. groff 1.18.2 will improve the situation for
ISO-8859-2 and provide a basis on which to build a more reliable 8-bit
clean device to serve as a stopgap until UTF-8 input is available.)
--
Colin Watson [cjwatson@flatline.org.uk]
Reply to: