[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#440420: [PROPOSAL] Manual page encoding

Colin Watson wrote:
Package: debian-policy
Severity: wishlist


--- orig/policy.sgml
+++ mod/policy.sgml
@@ -8450,6 +8450,39 @@
 	      be present in the future.
+	<p>
+	  Manual pages that are installed under
+	  <file>/usr/share/man/</file><var>ll</var>, where <var>ll</var>
+	  is an ISO-639 language code, must be encoded with the usual
+	  legacy (non-UTF-8) character set for that language, as shown
+	  by:
+	  <example compact="compact">
+egrep -v '\.|@|UTF-8' /usr/share/i18n/SUPPORTED
+	  </example>
+	  <footnote>
+	    This is necessary because many packages have historically
+	    included manual pages encoded thus, and changing the
+	    encoding of the whole hierarchy would involve a difficult
+	    transitional period.
+	  </footnote>
+	  Manual pages that are installed under
+	  <file>/usr/share/man/</file><var>locale</var>, where
+	  <var>locale</var> is a full locale name listed in
+	  <file>/usr/share/i18n/SUPPORTED</file>, must be encoded with
+	  the character set implied by that locale.
+	</p>

I don't like the proposal ;-)
It is not very POSIXly and to application specific.

The POSIX way to specify locale is:
language[_territory][.codeset] or
[language[_territory][.codeset][@modifier]] for some LC_ variables)

It is confusing the "legacy (non-UTF-8) character".
Every locale has a charset. So the man page should be
encoded according the right locale (in the manual PATH).

I've some problem with
Who generate this file?
IIRC our glibc has more locales.
I don't find "en", "de".

With the above point, I think that "en" (as example) has
a charset (from glibc), so man page should be set with
such charset.
Every other charset in a man page is a bug

+	<p>
+	  At present, it is not generally possible to install a manual
+	  page encoded in UTF-8 such that it will be used in all locales
+	  for that language (for example, a page installed under
+	  <file>/usr/share/man/fr_FR.UTF-8</file> will not be used in
+	  the <tt>fr_BE.UTF-8</tt> locale). It is therefore not yet
+	  recommended to install pages encoded in UTF-8, but rather to
+	  continue using the legacy encoding.<footnote>This is expected
+	  to change as of man-db 2.5.0.</footnote>
+	</p>

If I understand correctly, this is only a transitional comment, so
I think we should forget about this, and update the policy when
the man-db/man is corrected.

It will perhaps be helpful if I describe my transition plan for getting
manual pages into UTF-8. Contrary to what occasionally seems to be
popular belief, a newer version of groff is not necessary here (which is
just as well as repeated attempts to merge in the CJK patch have been
exceedingly painful, though I still hold out hope to get it done
eventually). man-db is capable of shoving in iconv pipes as necessary.

  1. Status at time of writing: packages should use only
     /usr/share/man/<ll>/ (although some packages have anticipated an
     approximation of the transition plan; we ignore these for the
     moment as there is little point in changing them only to change
     them back later), and must use the legacy encoding for pages
     installed there.

As above, I don't think it is incorrect.
But I agree that it will cause difficulties on an eventual change of
default encoding or to see what is the encoding of a given language.

  2. man-db 2.5.0-1 uploaded, including support for installing pages in
     /usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8). The
     basename of this directory is not typically a well-formed locale,
     but it is appropriate because it allows a clear specification of
     the hierarchy's encoding while applying to all countries using that

Use locale and locale priorities as specified on POSIX, and
allow full <locale> not only a subclass.

  3. man-db 2.5.0-1 moves into testing.

  4. Packages encouraged (via debian-devel-announce) to begin using
     /usr/share/man/<ll>.UTF-8/; installation in other hierarchies will
     not be necessary as man-db will recode as needed. Packages using
     these hierarchies will be encouraged to declare Conflicts: man-db
     (<< 2.5.0-1) (or will Breaks: be allowed by that point? is either
     one just overkill?).

I don't think we should go to UTF-8, but we should allow users
to use any good (for the language) charset.  It is also a lot difficult
to change charset or upstreams.

So I propose that manpage specify a charset (i.e. not using the
defaul local with only the language (and territory)).

  5. Update dh_installman to recode manual pages to UTF-8 automatically
     and install them under /usr/share/man/<ll>.UTF-8/. Getting the
     Conflicts:/Breaks: in here might be difficult, plus I'm not sure
     I'm wild about creating several thousand more arcs in our
     dependency graph. Maybe it's better just to wait for a stable
     release before changing debhelper, and not worry too much about the
     Conflicts:/Breaks: as it's not like the whole system will break as
     a result.

change: to encode on relevant charset.
BTW I think it should be done on dynamically on "man" program.

BTW there should be only one "original" man page per language, and
this page should create the other encodings (but for very special cases).
Otherwise it should be difficult to maintain in parallel the versions.

  6. Policy updated once this has been shaken down and confirmed to work

So without the transition comment.

  7. Distant future: deprecate /usr/share/man/<ll>/. This will only be
     for consistency, so there's no need to rush.

No, but in a short future: it should be a symbolic link to the
right (as defined in locale) ll.charset

Eventually we should discuss with glibc people
about locale definition, and how to export information
to other programs (and thus "man")


Reply to: