Re: Bug#440420: [PROPOSAL] Manual page encoding
On Sun, Dec 30, 2007 at 10:28:12PM -0800, Russ Allbery wrote:
> Colin Watson <firstname.lastname@example.org> writes:
> > I propose that policy should standardise that we move to using UTF-8 as
> > the source encoding for all manual pages since it clearly makes sense to
> > do so. This will still need to be specified by each manual page (by
> > means of the directory in which it is installed), and it does *not*
> > affect what user locales are supported in any way. The
> > internationalisation changes in man-db 2.5.0 will arrange for users to
> > see pages in their native language when they did not before; I do not
> > expect it to cause any users to fail to see pages in their native
> > language when they previously did.
> > Once man-db 2.5.0 is in place, the change in policy to recommend
> > installing pages with UTF-8 encoding in a properly marked directory will
> > have *no* effect on users, no matter what their locale. It is purely for
> > improved maintenance of the system.
> Hi Colin,
> I assume that now that man-db 2.5.0 is in the archive, the original patch
> in this bug report is no longer current and we should now be saying
> something different. We're trying to increase the speed of Policy work,
> so hopefully this time we can get a change made in a timely fashion. :)
Right. Here's an update; I think I've captured most of the discussion in
the thread so far. The following patch could in principle be applied
now, given seconds. Wordsmithing welcome, as I'm aware that this is a
rather dense recommendation; I'm also looking for seconds for this
@@ -8521,6 +8521,43 @@
be present in the future.
+ Manual pages installed under subdirectories of
+ <file>/usr/share/man</file> with a codeset specification (e.g.
+ <file>/usr/share/man/fr.UTF-8</file> or
+ <file>/usr/share/man/de_DE.ISO-8859-1</file>) must be encoded
+ using the named character encoding. The subdirectory name does
+ not need to be a well-formed locale as in
+ <file>/usr/share/i18n/SUPPORTED</file>; a language and
+ codeset, for example <file>de.UTF-8</file>, is all that is
+ necessary for most languages.<footnote>In fact, specifying a
+ country is often harmful, as it excludes users of the language
+ in other countries; de_DE would apply only to speakers of
+ German in Germany, and not to those in Austria.</footnote>
+ For compatibility with both previous versions of Debian and
+ other systems, manual pages in other locale-specific
+ subdirectories of <file>/usr/share/man</file> should use
+ either UTF-8 or the usual legacy encoding for that language
+ (usually the one corresponding to the shortest relevant locale
+ name in <file>/usr/share/i18n/SUPPORTED</file>). For example,
+ pages under <file>/usr/share/man/fr</file> should use either
+ UTF-8 or ISO-8859-1.<footnote><prgn>man</prgn> will
+ automatically detect whether UTF-8 is in use. In future, all
+ manual pages will be required to use UTF-8.</footnote>
+ Due to limitations in current implementations, all characters
+ in the manual page source should be representable in the usual
+ legacy encoding for that language, even if the file is
+ actually encoded in UTF-8. Safe alternative ways to write many
+ characters outside that range may be found in
+ <manref name="groff_char" section="7">.
Not lying about your encoding is a safe "must", I think, because this is
pretty much indisputable and I know of no cases of this rule being
broken in today's archive (though I haven't done a full scan).
Once we're a little further into the transition, I would like to replace
the second paragraph above with one that says that all manual pages
"should" be encoded in UTF-8.
I'm still open to whether new-world-order pages should go in
/usr/share/man/LL.UTF-8 or just /usr/share/man/LL. Pros for LL.UTF-8:
* Non-compliant implementations (I'm guessing xman, yelp, etc.) will
display English manual pages rather than misencoded garbage. This
might not be such a big deal for European languages, but for e.g.
Japanese I suspect most people would prefer English to the spew you
get by trying to interpret UTF-8 as EUC-JP.
* Determining progress towards universal UTF-8 encoding can trivially
be done by scanning Contents files rather than having to unpack the
archive and run iconv over everything.
* In the event that we later want to migrate to yet another
"universal" encoding that can't be automatically distinguished from
UTF-8, we already have the encoding name right there and migration
will be straightforward. (I think this is an unlikely scenario.)
* Many upstream developers using Debian systems will follow along
without realising that this only works with man-db. The result will
be that e.g. Red Hat users will miss out on localised manual pages
even though (AIUI) their man implementation expects UTF-8 in
* Changing dh_installman to move these files around might break a few
debian/rules files that name subdirectories of /usr/share/man
* As an aesthetic point, the debris of this transition will be visible
I think I am increasingly leaning towards just using /usr/share/man/LL,
seeing as man has to try decoding pages there as UTF-8 first anyway, but
please comment if you care.
> Could you send a new patch to document the current recommendations for how
> to encode man pages and deal with different locales when you get a chance?
Unfortunately 2.5.0 wasn't quite enough. Aside from a couple of stupid
bugs (mostly fixed now), it turns out that we need an extra feature to
allow debhelper to produce UTF-8 versions of manual pages without
needing the source encoding to be explicitly specified, by guessing the
encoding in the same way that man does:
I committed this feature to my development trunk earlier today, and will
be working on a 2.5.1 release over the next couple of weeks. After that
I'll send Joey a patch for debhelper.
Thus, an updated transition plan:
1. Initial status: packages should use only /usr/share/man/<ll>/
(although some packages have anticipated an approximation of the
transition plan; we ignore these for the moment as there is little
point in changing them only to change them back later), and must
use the legacy encoding for pages installed there.
2. man-db 2.5.0-1 uploaded, including support for installing pages in
/usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8). The
basename of this directory is not typically a well-formed locale,
but it allows a clear specification of the hierarchy's encoding
while applying to all countries using that language. [DONE]
3. man-db 2.5.1-1 uploaded, including 'man --recode'.
4. dh_installman updated to recode manual pages to UTF-8 automatically
(and install them under /usr/share/man/<ll>.UTF-8/?), using 'man
--recode UTF-8' to guess the original encoding. debhelper Depends:
man-db (>= 2.5.1-1) for this. Pages for which the DWIM fails can
include an explicit coding: directive, which will be documented.
5. man-db 2.5.1-1 and the corresponding debhelper move into testing.
6. Packages encouraged (via debian-devel-announce) to begin using
UTF-8 for manual pages (and /usr/share/man/<ll>.UTF-8/?). They do
not need to declare any package relationship on man-db for this.
7. Policy updated to recommend UTF-8 once this has been shaken down,
confirmed to work properly, and deployed through a reasonable chunk
of the archive thanks to debhelper.
8. Distant future: deprecate /usr/share/man/<ll>/. This will only be
for consistency, so there's no need to rush.
Colin Watson [email@example.com]