[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

UTF-8 manual pages



Manual pages may now be installed in UTF-8
==========================================

Historically, translated manual pages have been installed using a
variety of character encodings, usually legacy ones (ISO-8859-*, KOI8-R,
EUC-*, and so on). While these encodings are still supported, I now
recommend that Debian developers begin to install all manual pages in
UTF-8.

User locales are unaffected by this change. Provided that all the
characters involved can be handled in the locale in question, manual
pages installed in UTF-8 or indeed in other encodings will display just
as well regardless of the locale.

Pages should continue to be installed in /usr/share/man/LL where LL is
the ISO-639-1 code for the language. Country codes should not be used
unless they make a significant difference to the language (as with
pt_BR, zh_CN, and zh_TW). There is no need to include an encoding in the
directory name.

Dependencies
------------

The necessary support in man-db is now in testing, and consensus on
debian-policy was that no additional dependencies would in general be
needed when converting pages to UTF-8, any more than a package
delivering a file in a new version of HTML would need to conflict with
browsers that do not implement it. As an exception, maintainers of
packages consisting solely of translated manual pages may choose to
conflict with man-db (<< 2.5.1-1).

Migration arrangements
----------------------

For packages using debhelper and dh_installman, a simple rebuild with
6.0.5 or newer will install your manual pages in the UTF-8 character
encoding automatically [1]. If you do not wish to use dh_installman,
then the debhelper patch may give you an idea of how to do the same
thing by hand.

If dh_installman guesses the source encoding wrongly, see manconv(1) for
an override mechanism.

Manual pages that are maintained in the Debian diff or in Debian-native
packages may have their source form migrated to UTF-8 at your
convenience, perhaps in consultation with translators. Ordinary files
may be converted using 'iconv -f <original encoding> -t UTF-8', although
make sure to check the result to ensure that you have not produced
double-encoded UTF-8 (i.e. garbage) by mistake. For manual pages
produced using po4a, adding opt:"-L UTF-8" to the [type:man] section in
po4a.cfg, converting any addenda to UTF-8 as above, and regenerating the
output files should be sufficient.

There should generally be no need to ask upstream maintainers to convert
their manual pages to UTF-8. Support for legacy systems may often
require the use of legacy encodings, and the measures above mean that we
can move gradually towards a fully UTF-8 system without needing to
disturb their existing arrangements.

After migration
---------------

If you convert your source to UTF-8, note that current groff limitations
mean that you must ensure that all characters in the source should
continue to be representable in the usual legacy encoding for that
language (so, for example, a French manual page may not contain Russian
characters since those are not available in ISO-8859-1); this
occasionally causes problems, particularly when writing authors' names.
groff_char(7) may help you if this is a problem for you.

You should avoid using special Unicode punctuation characters such as
hyphens, dashes, and so on in manual page source files. See
groff_char(7) for safe equivalents. This work is intended for better
representation of alphabetic characters, not so that we can use more
Unicode gadgets.

Other software
--------------

Graphical manual page viewers may have problems with differing
encodings, although usually not significantly worse than the problems
they already had with encoding soup. I have sent patches for yelp [2]
and konqueror [3]; at least xman and tkman could also do with work from
developers who understand them.

As a general rule, viewers that implement their own manual page
rendering engines should read source files via 'man --recode UTF-8',
instruct the rendering engine to expect UTF-8, and depend on man-db (>=
2.5.1-1). Viewers that use groff, troff, or nroff to format manual pages
should instead use 'man -Tutf8' (which also removes the need to call
tbl, eqn, et al explicitly), instruct the display code to expect UTF-8,
and depend on man-db.

Policy manual
-------------

A policy amendment [4] is in progress to ratify these arrangements.

Acknowledgements and references
-------------------------------

Thanks to Adam Borowski, Jens Seidel, Russ Allbery, Brian M. Carlson,
Joey Hess, and others for discussion and work leading up to this.

I posted some blog entries [5] [6] while working on this, which may be
interesting for historical context.

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=462937
    "debhelper: recode manual pages to UTF-8"

[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=465229
    "yelp: Recode manual pages to UTF-8"

[3] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=449554
    "konqueror: man pages viewed in konqueror are not in utf-8 (but in
    iso8859 for fr ...)"

[4] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440420
    "[AMENDMENT 11/02/2008] Manual page encoding"

[5] http://www.chiark.greenend.org.uk/ucgi/~cjwatson/blosxom/2007-09-17-man-db-encodings.html
    "Encodings in man-db"

[6] http://www.chiark.greenend.org.uk/ucgi/~cjwatson/blosxom/2008-01-29-utf-8-manual-pages.html
    "UTF-8 manual pages"

Thanks,

-- 
Colin Watson                                       [cjwatson@debian.org]

Attachment: signature.asc
Description: Digital signature


Reply to: