[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#440420: [PROPOSAL] Manual page encoding

Package: debian-policy
Severity: wishlist

[CCs: debian-i18n and debian-doc for obvious reasons, and the debhelper
maintainer since there's a dh_installman change mentioned in the
transition plan further down.]

Recently I have encountered some confusion as to the proper encoding of
manual pages (which is entirely understandable given that this subsystem
is lagging somewhat behind the rest of the world in terms of UTF-8
support). As the man-db maintainer, I would like to clarify this in

Note that, while there are one or two instances of deviation which
prompted this proposal, this documents current practice in that it is
what has been implemented in man-db for some time and it is already
followed by the vast majority of packages. I don't believe that I'm
making large swathes of packages instantly buggy here; if they did not
follow this policy, they would already be buggy in that pages would be
displayed with visible encoding damage. Accordingly, I've tentatively
used a "must" for the encoding rules. I'm prepared to back off to a
"should" if consensus on the list is against me here.

I have used the language "not yet recommended" regarding installation of
UTF-8 manual pages. My intent here was not so much to normatively state
that this is a bug as to discourage it for the time being. As I noted in
a footnote, I do expect this to be supported properly in man-db 2.5.0,
which I've been working on for a while now (and in earnest for about the
last week).

I thus propose the following amendment, generated against
I am seeking comments on and seconds for this proposal.

--- orig/policy.sgml
+++ mod/policy.sgml
@@ -8450,6 +8450,39 @@
 	      be present in the future.
+	<p>
+	  Manual pages that are installed under
+	  <file>/usr/share/man/</file><var>ll</var>, where <var>ll</var>
+	  is an ISO-639 language code, must be encoded with the usual
+	  legacy (non-UTF-8) character set for that language, as shown
+	  by:
+	  <example compact="compact">
+egrep -v '\.|@|UTF-8' /usr/share/i18n/SUPPORTED
+	  </example>
+	  <footnote>
+	    This is necessary because many packages have historically
+	    included manual pages encoded thus, and changing the
+	    encoding of the whole hierarchy would involve a difficult
+	    transitional period.
+	  </footnote>
+	  Manual pages that are installed under
+	  <file>/usr/share/man/</file><var>locale</var>, where
+	  <var>locale</var> is a full locale name listed in
+	  <file>/usr/share/i18n/SUPPORTED</file>, must be encoded with
+	  the character set implied by that locale.
+	</p>
+	<p>
+	  At present, it is not generally possible to install a manual
+	  page encoded in UTF-8 such that it will be used in all locales
+	  for that language (for example, a page installed under
+	  <file>/usr/share/man/fr_FR.UTF-8</file> will not be used in
+	  the <tt>fr_BE.UTF-8</tt> locale). It is therefore not yet
+	  recommended to install pages encoded in UTF-8, but rather to
+	  continue using the legacy encoding.<footnote>This is expected
+	  to change as of man-db 2.5.0.</footnote>
+	</p>

It will perhaps be helpful if I describe my transition plan for getting
manual pages into UTF-8. Contrary to what occasionally seems to be
popular belief, a newer version of groff is not necessary here (which is
just as well as repeated attempts to merge in the CJK patch have been
exceedingly painful, though I still hold out hope to get it done
eventually). man-db is capable of shoving in iconv pipes as necessary.

  1. Status at time of writing: packages should use only
     /usr/share/man/<ll>/ (although some packages have anticipated an
     approximation of the transition plan; we ignore these for the
     moment as there is little point in changing them only to change
     them back later), and must use the legacy encoding for pages
     installed there.

  2. man-db 2.5.0-1 uploaded, including support for installing pages in
     /usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8). The
     basename of this directory is not typically a well-formed locale,
     but it is appropriate because it allows a clear specification of
     the hierarchy's encoding while applying to all countries using that

  3. man-db 2.5.0-1 moves into testing.

  4. Packages encouraged (via debian-devel-announce) to begin using
     /usr/share/man/<ll>.UTF-8/; installation in other hierarchies will
     not be necessary as man-db will recode as needed. Packages using
     these hierarchies will be encouraged to declare Conflicts: man-db
     (<< 2.5.0-1) (or will Breaks: be allowed by that point? is either
     one just overkill?).

  5. Update dh_installman to recode manual pages to UTF-8 automatically
     and install them under /usr/share/man/<ll>.UTF-8/. Getting the
     Conflicts:/Breaks: in here might be difficult, plus I'm not sure
     I'm wild about creating several thousand more arcs in our
     dependency graph. Maybe it's better just to wait for a stable
     release before changing debhelper, and not worry too much about the
     Conflicts:/Breaks: as it's not like the whole system will break as
     a result.

  6. Policy updated once this has been shaken down and confirmed to work

  7. Distant future: deprecate /usr/share/man/<ll>/. This will only be
     for consistency, so there's no need to rush.

This shouldn't be too difficult from where I am now, and at the moment I
see no obstacles to landing UTF-8 manual page support for lenny. Note
that the implementation using iconv will mean that any characters used
that are not recodable to the corresponding legacy encoding will be
discarded; this is difficult to avoid without upgrading groff, but I
don't anticipate it being a substantial problem. Likewise, we'll
probably still be unable to handle Arabic and Indic scripts properly,
and CJK will probably still be a massive hack; but it'll be an


Colin Watson                                       [cjwatson@debian.org]

Attachment: signature.asc
Description: Digital signature

Reply to: