[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#440420: [PROPOSAL] Manual page encoding



On Mon, Sep 03, 2007 at 05:38:10PM +0200, Giacomo A. Catenazzi wrote:
> Colin Watson wrote:
> >--- orig/policy.sgml
> >+++ mod/policy.sgml
> >@@ -8450,6 +8450,39 @@
> > 	      be present in the future.
> >  	  </footnote>
> >  	</p>
> >+
> >+	<p>
> >+	  Manual pages that are installed under
> >+	  <file>/usr/share/man/</file><var>ll</var>, where <var>ll</var>
> >+	  is an ISO-639 language code, must be encoded with the usual
> >+	  legacy (non-UTF-8) character set for that language, as shown
> >+	  by:
> >+	  <example compact="compact">
> >+egrep -v '\.|@|UTF-8' /usr/share/i18n/SUPPORTED
> >+	  </example>
> >+	  <footnote>
> >+	    This is necessary because many packages have historically
> >+	    included manual pages encoded thus, and changing the
> >+	    encoding of the whole hierarchy would involve a difficult
> >+	    transitional period.
> >+	  </footnote>
> >+	  Manual pages that are installed under
> >+	  <file>/usr/share/man/</file><var>locale</var>, where
> >+	  <var>locale</var> is a full locale name listed in
> >+	  <file>/usr/share/i18n/SUPPORTED</file>, must be encoded with
> >+	  the character set implied by that locale.
> >+	</p>
> 
> I don't like the proposal ;-)
> It is not very POSIXly and to application specific.

Of course it is application-specific; /usr/share/man is
application-specific (i.e. specific to the man application). Methods of
processing /usr/share/man that don't use /usr/bin/man are already broken
in other ways. (man exports a number of specialised interfaces that can
be used by frontends, and I'm happy to add more on request.)

POSIX does not specify anything about the layout of /usr/share/man. The
FHS makes an attempt, but it's horribly broken (speaking as one who has
attempted to implement it), predates widespread deployment of UTF-8, and
does not really help with the problem to hand anyway.

> 1-
> The POSIX way to specify locale is:
> language[_territory][.codeset] or
> [language[_territory][.codeset][@modifier]] for some LC_ variables)

Note that e.g. fr.UTF-8 matches this pattern, so I don't see your
problem. The territory is intentionally omitted from the installation
directory in my transition plan because it causes real problems.

man will support full locale names under /usr/share/man, but in my
transition plan I do not recommend using them because you don't
typically want to make your French manual pages available only to users
in France; they should be available to Belgians, French Canadians, Swiss
French, and Luxembourgers as well. The standard exceptions well-known to
internationalisation implementors are Chinese (zh_CN and zh_TW are
different dialects and different scripts) and Portuguese (pt_PT and
pt_BR are more or less different languages).

> It is confusing the "legacy (non-UTF-8) character".

Yes, it is, but it is current practice and I merely document it. If we
were starting from scratch with the benefit of hindsight then obviously
we wouldn't have done it this way.

I think it's unambiguous for all languages where we actually have
existing manual pages to worry about.

> Every locale has a charset. So the man page should be
> encoded according the right locale (in the manual PATH).

My proposal (the diff, as opposed to the transition plan later in my
original message) documents current practice, in which manual pages are
installed in directories such as /usr/share/man/fr. "fr" is not a full
locale name recognised by glibc, and does not have a defined character
set in our system. Thus, we must define its character set by means of
observing that historically pages installed there have been encoded in
ISO-8859-1, and standardising that to prevent unsolvable encoding
conflicts.

In future, it absolutely makes sense to install the pages in
/usr/share/man/fr.UTF-8 instead, which is where my transition plan takes
us. But, for now, the only available alternatives are
/usr/share/man/fr_FR.ISO-8859-1 and /usr/share/man/fr_FR.UTF-8, which
(as above) have fundamental problems, and in any case are not
well-supported at the moment (in man-db 2.4.*,
/usr/share/man/fr_FR.UTF-8 will only be used if you are using that exact
locale; in man-db 2.5.0, it will be used for users of the fr_FR
(ISO-8859-1) locale as well and recoded on the fly, so that you don't
have to install one manual page per possible encoding).

> 2-
> I've some problem with
> /usr/share/i18n/SUPPORTED
> Who generate this file?
> IIRC our glibc has more locales.

glibc ships this file.

  $ dpkg -S /usr/share/i18n/SUPPORTED
  locales: /usr/share/i18n/SUPPORTED
  $ apt-cache show locales | grep Source:
  Source: glibc

> I don't find "en", "de".

That's because glibc does not recognise those as valid locales. If you
believe that a locale exists in our system but it is not in
/usr/share/i18n/SUPPORTED, you are by definition mistaken. :-)

> 3-
> With the above point, I think that "en" (as example) has
> a charset (from glibc), so man page should be set with
> such charset.

Your assumption is mistaken, I'm afraid. /usr/share/i18n/SUPPORTED is
the canonical list of available locales in our system. There is no
straightforward way to ask the question "what is the conventional legacy
character set for <language>?" without also specifying a country, which
doesn't help when trying to determine the character set of files under
/usr/share/man/fr. That's why man has its own table for this.

> >+
> >+	<p>
> >+	  At present, it is not generally possible to install a manual
> >+	  page encoded in UTF-8 such that it will be used in all locales
> >+	  for that language (for example, a page installed under
> >+	  <file>/usr/share/man/fr_FR.UTF-8</file> will not be used in
> >+	  the <tt>fr_BE.UTF-8</tt> locale). It is therefore not yet
> >+	  recommended to install pages encoded in UTF-8, but rather to
> >+	  continue using the legacy encoding.<footnote>This is expected
> >+	  to change as of man-db 2.5.0.</footnote>
> >+	</p>
> >       </sect>
> > 
> >       <sect>
> 
> If I understand correctly, this is only a transitional comment, so
> I think we should forget about this, and update the policy when
> the man-db/man is corrected.

I'm happy to go that route too; I simply thought in the event that a
policy upload was coming soon then it might be helpful to document
current practice. It also gives me something to document the new policy
against after man-db 2.5.0. :-)

> >  2. man-db 2.5.0-1 uploaded, including support for installing pages
> >  in /usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8).
> >  The basename of this directory is not typically a well-formed
> >  locale, but it is appropriate because it allows a clear
> >  specification of the hierarchy's encoding while applying to all
> >  countries using that language.
> 
> Use locale and locale priorities as specified on POSIX, and allow full
> <locale> not only a subclass.

man-db permits them and will continue to do so, but as above I strongly
believe that with the exception of Chinese and Portuguese it is not
generally to our users' advantage to install manual pages under full
locale names, unless you're lucky enough to use a language spoken in
only one country. (IIRC you're in Switzerland; do you use it_CH.UTF-8?
If so, you would not be well-served by pages specifying it_IT.UTF-8, in
the same way that you would not be well-served by .po files specifying
"it_IT" rather than just "it".)

> >  3. man-db 2.5.0-1 moves into testing.
> >
> >  4. Packages encouraged (via debian-devel-announce) to begin using
> >  /usr/share/man/<ll>.UTF-8/; installation in other hierarchies will
> >  not be necessary as man-db will recode as needed. Packages using
> >  these hierarchies will be encouraged to declare Conflicts: man-db
> >  (<< 2.5.0-1) (or will Breaks: be allowed by that point? is either
> >  one just overkill?).
> 
> I don't think we should go to UTF-8, but we should allow users to use
> any good (for the language) charset.  It is also a lot difficult to
> change charset or upstreams.

I should clarify that /usr/share/man/<ll>.UTF-8/ will be used by man for
all <ll>* locales, not merely for those where the user requested UTF-8;
man will recode to the appropriate character set on the fly.

It is true that manual pages could be installed using any character set
and would work fine, but since we will be able to standardise on UTF-8 I
think we should do so, for all the same reasons that we should
standardise on UTF-8 elsewhere: for one, it greatly simplifies things if
you're looking at manual page source for whatever reason.

Upstreams do not need to change, or at least can change at their
leisure; it's trivial to recode the page to UTF-8 in debian/rules.

> So I propose that manpage specify a charset (i.e. not using the defaul
> local with only the language (and territory)).

That is what I'm doing here. The character set named in the directory
name specifies the encoding for all manual pages installed under that
directory; it does not mandate that only users of that character set may
use these manual pages. (I understand your confusion since this is not
what is implemented in current man-db, but frankly that implementation
doesn't benefit anyone.)

There are other ways of specifying the encoding such as by putting them
in a header in the page itself, but those are much less convenient in
practice and are less efficient when implemented (since you have to
decompress and open the page before you can find its encoding).

> >  5. Update dh_installman to recode manual pages to UTF-8
> >  automatically and install them under /usr/share/man/<ll>.UTF-8/.
> >  Getting the Conflicts:/Breaks: in here might be difficult, plus I'm
> >  not sure I'm wild about creating several thousand more arcs in our
> >  dependency graph. Maybe it's better just to wait for a stable
> >  release before changing debhelper, and not worry too much about the
> >  Conflicts:/Breaks: as it's not like the whole system will break as
> >  a result.
> 
> change: to encode on relevant charset. BTW I think it should be done
> on dynamically on "man" program.

As above, you appear to have misunderstood the transition plan; man will
recode dynamically.

> BTW there should be only one "original" man page per language, and
> this page should create the other encodings (but for very special
> cases). Otherwise it should be difficult to maintain in parallel the
> versions.

There should be only one manual page per language, full stop. In the new
world order, it should be installed under /usr/share/man/<ll>.UTF-8 and
all other encodings will be generated on the fly.

> >  7. Distant future: deprecate /usr/share/man/<ll>/. This will only
> >  be for consistency, so there's no need to rush.
> 
> No, but in a short future: it should be a symbolic link to the right
> (as defined in locale) ll.charset

No, this cannot be done safely (it will create incompatibility) and is
furthermore unnecessary and confusing. In any case it is not possible
for a symbolic link on the filesystem to be dependent on the user's
locale. This is handled in other ways.

> Eventually we should discuss with glibc people about locale
> definition, and how to export information to other programs (and thus
> "man")

I've implemented all this personally; glibc already provides all the
information I need, aside from the strange question of "conventional
legacy encodings" which is an extremely ambiguous and debatable request
to make of glibc in any case and which is already handled in a good
enough way in man. There is no need for glibc to change here.

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: