[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: manpage character cleanup for UTF-8 compatibility



On Tue, Mar 25, 2003 at 04:01:51PM -0800, Vineet Kumar wrote:
> Using a UTF-8 locale, I've been finding many manpages using incorrect
> characters.  Groff converts many of these characters to reasonable
> characters in ASCII locales, but some things break in UTF-8 locales.

One other thought has occurred to me while working on fixing certain
parts of man-db's locale support. Sooner or later, when groff 2 is
released (but not beforehand!), we're going to have to move towards
having all man pages encoded in UTF-8. For most languages this probably
isn't too bad: you just use de_DE.UTF-8 rather than de, or whatever
(although I'm not sure how that'd work for languages with multiple
regional variants). It's going to be a royal pain for English, though,
because currently we just put things directly in /usr/share/man, meaning
the C locale, and there's no C.UTF-8, probably for good reasons.
en_US.UTF-8 would be a poor choice because we also need en_GB.UTF-8 and
so on.

So, could I request that all English man pages follow this advice from
groff_char(7)?

       All  roff  systems provide the concept of named characters.  In
       traditional roff systems, only names  of  length 2  were  used,
       while  groff  also  provides  support  for longer names.  It is
       strongly suggested that only named characters are used for  all
       characters outside of the 7-bit ASCII range.

If you refrain from using any characters outside 7-bit ASCII, then the
problem of encoding doesn't arise, and the transition to groff 2 some
time in the future will be a great deal easier because we can eventually
just start assuming that English man pages are in UTF-8 and not have to
care because it won't make any difference to existing pages. For
instance, please write "na\(:ive" or "na\[:i]ve" rather than "naïve".
groff_char(7) has a list of the named characters you can use in this
way.

I think it would be a good idea for someone with time to audit
/usr/share/man for this, or to write a lintian patch. I suppose whether
things like pod2man will cause problems will prove interesting.

(Languages not encoded in ISO-8859-1 are being handled by somewhat
broken brute force at the moment anyway, so for now they should probably
ignore this advice. groff 1.18.2 will improve the situation for
ISO-8859-2 and provide a basis on which to build a more reliable 8-bit
clean device to serve as a stopgap until UTF-8 input is available.)

-- 
Colin Watson                                  [cjwatson@flatline.org.uk]



Reply to: