[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

UTF-8 manual pages

I have uploaded man-db 2.5.0-1, which includes the following changes of

        o Per-locale directory handling has been improved. Directories such
          as "fr.UTF-8" may be used for occasions when it is appropriate to
          specify the character set but not the country, and so a full
          locale name is inconvenient.

        o There is a new "manconv" program which can try multiple possible
          encodings for a file, thus allowing UTF-8 manual pages to be
          installed in any directory even without an explicit encoding

I would like to recommend that package maintainers, particularly
maintainers of manpages-* packages, begin to install manual pages
encoded in UTF-8, so that we can shake down the details before
encapsulating this in the policy manual.

My original plan was that UTF-8 manual pages should be installed in
/usr/share/man/<language>.UTF-8/ (unless your language is Chinese or
Portuguese, just use the language code, not the country code, so for
example French manual pages would go in /usr/share/man/fr.UTF-8/). I had
a long discussion with Adam Borowski on debian-mentors recently in which
he persuaded me that it was both possible and worth it to implement
compatibility with the scheme used by e.g. Red Hat, in which manual
pages installed in unadorned directories such as /usr/share/man/fr/ are
assumed to be in UTF-8. To avoid the obvious transitional nightmare, the
"manconv" program mentioned above guesses the file encoding on the fly,
so both UTF-8 and legacy encodings are permitted. For reasons that will
be obvious to those familiar with the details of character encodings, it
is usually only possible to guess between UTF-8 and a single other
encoding this way, but that's good enough for us.

This means that we now have a choice of putting UTF-8 manual pages in
/usr/share/man/<language>/ or /usr/share/man/<language>.UTF-8/. Although
Adam made a valiant effort to persuade me otherwise, I still favour the
.UTF-8 suffix; it's explicit, and it means that if your man program
doesn't support UTF-8 (xman and yelp probably don't, for instance) then
you will get the English manual page rather than a pile of misencoded
garbage. Whether you think this is desirable probably depends on your
language; a misencoded French page would be mostly readable anyway,
while a misencoded Japanese page is entirely unusable.

However, I'm posting to debian-devel and debian-i18n about this to give
people the opportunity to advocate the other position. At this point,
neither choice will present major technical difficulties as far as
man-db is concerned. I would like to ask that people consider the
practicalities of other man implementations as well as pure aesthetic

groff does not yet support UTF-8 input, so at the moment this is
implemented by recoding in man. For the time being, the implementation
requires that the page be convertible to the legacy encoding for the
language using iconv (it uses //TRANSLIT so that it will make an attempt
at characters that aren't directly convertible, but that isn't perfect);
so a German manual page should avoid using UTF-8 characters without an
equivalent in ISO-8859-1. I do not expect this to be particularly
onerous for the time being, though there are a few cases (particularly
proper names) where it may be a problem. I ask for your patience in
those cases. If you need to use a character not in the corresponding
legacy encoding, then I recommend using named character escapes as
documented in groff_char(7).

Once we have a consensus on install locations, dh_installman should IMO
be changed to do the recoding automatically; to do this, it needs to be
told the source encoding. Joey, what do you think is the best way to do
this? Options that come to mind are:

  * --language=<ll>.ISO-8859-1
  * --source-encoding=ISO-8859-1
  * manpage:ISO-8859-1 on the command line or in debian/package.manpages

It's worth noting that packages may well have manual pages in a number
of languages with a variety of encodings, so I'm not sure how well a
global --source-encoding option would work.

Of course the other option would be for dh_installman to DWIM and guess
the encoding in the same way man does. :-) The transition to UTF-8 would
happen much faster if maintainers didn't have to specify the encoding by
hand. If you'd like to take this approach I can add code to man-db as
necessary to help out.


Colin Watson                                       [cjwatson@debian.org]

Reply to: