Bug#440420: [PROPOSAL] Manual page encoding

To: Colin Watson <cjwatson@debian.org>, 440420@bugs.debian.org
Subject: Bug#440420: [PROPOSAL] Manual page encoding
From: "Giacomo A. Catenazzi" <cate@debian.org>
Date: Tue, 04 Sep 2007 10:55:03 +0200
Message-id: <[🔎] 46DD1D67.6020906@debian.org>
Reply-to: "Giacomo A. Catenazzi" <cate@debian.org>, 440420@bugs.debian.org
In-reply-to: <[🔎] 20070903164719.GE6091@riva.ucam.org>
References: <[🔎] 20070901120232.GB18492@riva.ucam.org> <[🔎] 46DC2A62.50402@debian.org> <[🔎] 20070903164719.GE6091@riva.ucam.org>

Colin Watson wrote:

On Mon, Sep 03, 2007 at 05:38:10PM +0200, Giacomo A. Catenazzi wrote:

Colin Watson wrote:

I don't like the proposal ;-)
It is not very POSIXly and to application specific.


Of course it is application-specific; /usr/share/man is
application-specific (i.e. specific to the man application). Methods of
processing /usr/share/man that don't use /usr/bin/man are already broken
in other ways. (man exports a number of specialised interfaces that can
be used by frontends, and I'm happy to add more on request.)


But we have the same problem with info, with the HOWTO, with the
doc, ....
[For most of such program, it is not a huge problem, the building
scripts could correct (i.e. from 8-bit to tex or html "symbol" codes]

For this reason, I would like a general policy and solution.
(The /usr/share/man then it would a follow-up policy)

Or there is fewer problem on other docs?

POSIX does not specify anything about the layout of /usr/share/man. The
FHS makes an attempt, but it's horribly broken (speaking as one who has
attempted to implement it), predates widespread deployment of UTF-8, and
does not really help with the problem to hand anyway.


Yes, I saw (and there are some strange consideration), but I meant:
POSIX define locales and how application use locales.
If we convert manpages with UTF-8, I think we broke posix:
the user can see wrong encoding.
But ok, this should be only an implementation details, but
not let broke non UTF-8 valid locales.

1-
The POSIX way to specify locale is:
language[_territory][.codeset] or
[language[_territory][.codeset][@modifier]] for some LC_ variables)


Note that e.g. fr.UTF-8 matches this pattern, so I don't see your
problem. The territory is intentionally omitted from the installation
directory in my transition plan because it causes real problems.


yes, was only a commentary. I read wrong the rest thread
(a patch correcting markup and not an alternate proposal).

man will support full locale names under /usr/share/man, but in my
transition plan I do not recommend using them because you don't
typically want to make your French manual pages available only to users
in France; they should be available to Belgians, French Canadians, Swiss
French, and Luxembourgers as well. The standard exceptions well-known to
internationalisation implementors are Chinese (zh_CN and zh_TW are
different dialects and different scripts) and Portuguese (pt_PT and
pt_BR are more or less different languages).


Yes, I was sure there was exceptions (on locales there are special
cases everywere).
But I was thinking to a possible over-engineering: manpages that
explain output of the program: the output in an ideal world should
be written in the user locale (number and dates).
I think now should not be done for fr_FR, fr_CH, fr_CA, fr_BE,
but in future (and in an automatic way) maybe.
So in the policy I would mention the possible triplets
(for application reading the files), but OTOH, man pages
should not yet be installed with a territory (and eventually policy
could list the zh_ and pt_ [and what we forget])

It is confusing the "legacy (non-UTF-8) character".


Yes, it is, but it is current practice and I merely document it. If we
were starting from scratch with the benefit of hindsight then obviously
we wouldn't have done it this way.

I think it's unambiguous for all languages where we actually have
existing manual pages to worry about.


I don't like the wording.  Now it seems that UTF-8 is superior
to other encoding, but we should take UTF-8 as the ultimate
encoding.  I propose a simple "non-UTF-8 character".
Anyway this is a very minor point.

Every locale has a charset. So the man page should be
encoded according the right locale (in the manual PATH).


My proposal (the diff, as opposed to the transition plan later in my
original message) documents current practice, in which manual pages are
installed in directories such as /usr/share/man/fr. "fr" is not a full
locale name recognised by glibc, and does not have a defined character
set in our system. Thus, we must define its character set by means of
observing that historically pages installed there have been encoded in
ISO-8859-1, and standardising that to prevent unsolvable encoding
conflicts.

In future, it absolutely makes sense to install the pages in
/usr/share/man/fr.UTF-8 instead, which is where my transition plan takes
us. But, for now, the only available alternatives are
/usr/share/man/fr_FR.ISO-8859-1 and /usr/share/man/fr_FR.UTF-8, which
(as above) have fundamental problems, and in any case are not
well-supported at the moment (in man-db 2.4.*,
/usr/share/man/fr_FR.UTF-8 will only be used if you are using that exact
locale; in man-db 2.5.0, it will be used for users of the fr_FR
(ISO-8859-1) locale as well and recoded on the fly, so that you don't
have to install one manual page per possible encoding).


Ok. I used a wrong assumption ("fr" is not a legal locale).
See other comment on point 4 on transition plan.

2-
I've some problem with
/usr/share/i18n/SUPPORTED

(...)

I don't find "en", "de".


That's because glibc does not recognise those as valid locales. If you
believe that a locale exists in our system but it is not in
/usr/share/i18n/SUPPORTED, you are by definition mistaken. :-)


Yes. I was confusing about HTTP language syndication (I don't
remember exactly the word). On POSIX I found nothing about
priorities.

3-
With the above point, I think that "en" (as example) has
a charset (from glibc), so man page should be set with
such charset.


Your assumption is mistaken, I'm afraid. /usr/share/i18n/SUPPORTED is
the canonical list of available locales in our system. There is no
straightforward way to ask the question "what is the conventional legacy
character set for <language>?" without also specifying a country, which
doesn't help when trying to determine the character set of files under
/usr/share/man/fr. That's why man has its own table for this.


yes, wrong assumptions.

 2. man-db 2.5.0-1 uploaded, including support for installing pages
 in /usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8).
 The basename of this directory is not typically a well-formed
 locale, but it is appropriate because it allows a clear
 specification of the hierarchy's encoding while applying to all
 countries using that language.

Use locale and locale priorities as specified on POSIX, and allow full
<locale> not only a subclass.


man-db permits them and will continue to do so, but as above I strongly
believe that with the exception of Chinese and Portuguese it is not
generally to our users' advantage to install manual pages under full
locale names, unless you're lucky enough to use a language spoken in
only one country. (IIRC you're in Switzerland; do you use it_CH.UTF-8?
If so, you would not be well-served by pages specifying it_IT.UTF-8, in
the same way that you would not be well-served by .po files specifying
"it_IT" rather than just "it".)


my language and locale is "C", although I created the it_CH for glibc ;-)
As explained above, I would make a note so that program could expect
territory, but for now we should not install man page in a triplet
(but ev. fot pt_ and zh_)

 3. man-db 2.5.0-1 moves into testing.

 4. Packages encouraged (via debian-devel-announce) to begin using
 /usr/share/man/<ll>.UTF-8/; installation in other hierarchies will
 not be necessary as man-db will recode as needed. Packages using
 these hierarchies will be encouraged to declare Conflicts: man-db
 (<< 2.5.0-1) (or will Breaks: be allowed by that point? is either
 one just overkill?).

I don't think we should go to UTF-8, but we should allow users to use
any good (for the language) charset.  It is also a lot difficult to
change charset or upstreams.


I should clarify that /usr/share/man/<ll>.UTF-8/ will be used by man for
all <ll>* locales, not merely for those where the user requested UTF-8;
man will recode to the appropriate character set on the fly.

It is true that manual pages could be installed using any character set
and would work fine, but since we will be able to standardise on UTF-8 I
think we should do so, for all the same reasons that we should
standardise on UTF-8 elsewhere: for one, it greatly simplifies things if
you're looking at manual page source for whatever reason.

Upstreams do not need to change, or at least can change at their
leisure; it's trivial to recode the page to UTF-8 in debian/rules.


"man will recode to the appropriate character set on the fly.",
so on point 3, you should mention also a new "man" version.

I like UTF-8, but I don't like that we set UTF-8 as
predefinite debian encoding.
And in such case, I would set a default policy (not only
for manpages, for debian/changelog, ...).

Anyway, IIRC there was some negative comment about email
in UTF-8, in the discussion about DPL vote and wrong
MUA handling of signed UTF-8 vote.

Do you think it is feasible to convert manpage on UTF-8,
from the non-latin alphabet?
For this point we should see commentary on i18n list

So I propose that manpage specify a charset (i.e. not using the defaul
local with only the language (and territory)).


That is what I'm doing here. The character set named in the directory
name specifies the encoding for all manual pages installed under that
directory; it does not mandate that only users of that character set may
use these manual pages. (I understand your confusion since this is not
what is implemented in current man-db, but frankly that implementation
doesn't benefit anyone.)


But you propose only "UTF-8" encoding.
Unfortunately Debian is no more the upstream of man-db.

There are other ways of specifying the encoding such as by putting them
in a header in the page itself, but those are much less convenient in
practice and are less efficient when implemented (since you have to
decompress and open the page before you can find its encoding).


No, I agree that directory based selection of encoding is better.

BTW there should be only one "original" man page per language, and
this page should create the other encodings (but for very special
cases). Otherwise it should be difficult to maintain in parallel the
versions.


There should be only one manual page per language, full stop. In the new
world order, it should be installed under /usr/share/man/<ll>.UTF-8 and
all other encodings will be generated on the fly.

ok

 7. Distant future: deprecate /usr/share/man/<ll>/. This will only
 be for consistency, so there's no need to rush.

No, but in a short future: it should be a symbolic link to the right
(as defined in locale) ll.charset


No, this cannot be done safely (it will create incompatibility) and is
furthermore unnecessary and confusing. In any case it is not possible
for a symbolic link on the filesystem to be dependent on the user's
locale. This is handled in other ways.


No, I meant "fr" point to "fr_ISO-8859-1".  But I used the wrong
assumption. So forget my comment.

Eventually we should discuss with glibc people about locale
definition, and how to export information to other programs (and thus
"man")


I've implemented all this personally; glibc already provides all the
information I need, aside from the strange question of "conventional
legacy encodings" which is an extremely ambiguous and debatable request
to make of glibc in any case and which is already handled in a good
enough way in man. There is no need for glibc to change here.


Also this was about the wrong assumption. I was not finding
option on locale(1) or on other files about what was the
default encoding of "fr".  But considering it is not a valid
locale, no problem here.


In summary, now I'm ok with your proposal.
I don't like the "hardcoded" UTF-8, and I'm not sure that
an automatic conversion is featible for some non latin alphabet.
But it is the only clean and reasonable solution.

ciao
	cate

Reply to:

Follow-Ups:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>

References:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>

Prev by Date: Bug#440420: [PROPOSAL] Manual page encoding
Next by Date: Bug#440420: [PROPOSAL] Manual page encoding
Previous by thread: Bug#440420: [PROPOSAL] Manual page encoding
Next by thread: Bug#440420: [PROPOSAL] Manual page encoding
Index(es):
- Date
- Thread