Bug#440420: [PROPOSAL] Manual page encoding

To: "Giacomo A. Catenazzi" <cate@debian.org>
Cc: 440420@bugs.debian.org, debian-i18n@lists.debian.org
Subject: Bug#440420: [PROPOSAL] Manual page encoding
From: Colin Watson <cjwatson@debian.org>
Date: Tue, 4 Sep 2007 11:52:57 +0100
Message-id: <[🔎] 20070904105256.GK6091@riva.ucam.org>
Reply-to: Colin Watson <cjwatson@debian.org>, 440420@bugs.debian.org
In-reply-to: <[🔎] 46DD1D67.6020906@debian.org>
References: <[🔎] 20070901120232.GB18492@riva.ucam.org> <[🔎] 46DC2A62.50402@debian.org> <[🔎] 20070903164719.GE6091@riva.ucam.org> <[🔎] 46DD1D67.6020906@debian.org>
On Tue, Sep 04, 2007 at 10:55:03AM +0200, Giacomo A. Catenazzi wrote:
> Colin Watson wrote:
> >On Mon, Sep 03, 2007 at 05:38:10PM +0200, Giacomo A. Catenazzi wrote:
> >>I don't like the proposal ;-)
> >>It is not very POSIXly and to application specific.
> >
> >Of course it is application-specific; /usr/share/man is
> >application-specific (i.e. specific to the man application). Methods of
> >processing /usr/share/man that don't use /usr/bin/man are already broken
> >in other ways. (man exports a number of specialised interfaces that can
> >be used by frontends, and I'm happy to add more on request.)
> 
> But we have the same problem with info, with the HOWTO, with the
> doc, ....

Manual pages are different because:

  * They are not typically read directly, but via a toolset that is
    capable of dealing with such matters as encoding translation in a
    manner appropriate to the user's locale. In other words, we can
    safely recommend UTF-8 in the comfortable knowledge that it can be
    done transparently.

    info may share this property; I'm not sure because I'm not familiar
    with it at the implementation level and I haven't noticed it having
    much in the way of internationalisation support in general.

    I don't particularly object to recommending UTF-8 for HTML
    documentation as such, but it is clearly less convenient as you need
    to adjust the files themselves to declare a character set rather
    than just installing them in a different place.

    Other documentation is often read with a simple pager. UTF-8 is
    probably the most convenient encoding long-term in order that you
    can read documentation in more than one language without
    reconfiguring your software, but I imagine there is plenty of room
    for local exceptions here and it is certainly less clear.

  * As a general rule, manual pages are much better localised than other
    documentation. That is, they actually get localised. We may not be
    anywhere close to completion, but compare it to the other forms of
    documentation you mentioned: info has a handful of translations with
    a variety of naming conventions (is there any client support for
    selecting them automatically?), and random files in /usr/share/doc
    typically aren't localised or at best maybe have one or two
    translations (usually in the upstream author's native language). The
    only other form of documentation I'm aware of with a comparable
    level of localisation is the HOWTOs from the Linux Documentation
    Project.

  * Because our current groff implementation imposes quite strict
    restrictions on what input and output encodings are possible, and
    usually needs to know detailed information about these encodings in
    order to achieve correct typography, it is if anything more
    important than usual for man to have an accurate idea of the
    document's character set.

  * Because manual page encoding is specified by means of file system
    location, and because only a strict subset of the file system is
    allowed, it is important for policy to specify how this is to be
    handled across many packages for interoperability, more so than for
    forms of documentation where file system location is immaterial.

> For this reason, I would like a general policy and solution.
> (The /usr/share/man then it would a follow-up policy)
> 
> Or there is fewer problem on other docs?

I don't think it's really reasonable or necessary to create a general
policy covering both /usr/share/man and other documentation in a single
piece of text. The requirements are too different, and several different
documentation formats have their own special requirements and need to
move at their own pace. Current policy wisely does not attempt to treat
them as a single unit, but has subsections for the two major specialised
formats (man and info).

> >POSIX does not specify anything about the layout of /usr/share/man. The
> >FHS makes an attempt, but it's horribly broken (speaking as one who has
> >attempted to implement it), predates widespread deployment of UTF-8, and
> >does not really help with the problem to hand anyway.
> 
> Yes, I saw (and there are some strange consideration), but I meant:
> POSIX define locales and how application use locales.
> If we convert manpages with UTF-8, I think we broke posix:
> the user can see wrong encoding.

No, you still don't understand. The conversion is only applied to the
source files, not what users see. POSIX does not impose requirements on
the encoding of applications' data files: each file clearly has to have
an encoding and an application that can know what encoding is in use and
convert it to the user's locale is clearly doing a better job than one
that can't.

> But I was thinking to a possible over-engineering: manpages that
> explain output of the program: the output in an ideal world should
> be written in the user locale (number and dates).

You mean the LC_NUMERIC and LC_TIME locale categories? There is no
support for this in groff and I think this is unlikely to happen. As you
suggest yourself, this is overengineering; a manual page is probably
better advised to explain in prose, as it's not at all impossible for a
user to look at a manual page in a different locale.

In any case, I would appreciate it if you didn't distract this proposal
that's purely about encodings to become a general debate about wishlists
for locale handling in manual pages.

> So in the policy I would mention the possible triplets
> (for application reading the files),

Triplets? Do you mean language[_territory][.codeset]? Just say "locales"
rather than inventing a new term.

I'm not sure what you want to be mentioned, though. Are you looking for
a complete specification of the possible subdirectory names under
/usr/share/man? Perhaps it would be better to document that in man-db,
and leave policy to recommend the best choice rather than document all
possible choices. After all, the policy group's job is to take
decisions.

> >>It is confusing the "legacy (non-UTF-8) character".
> >
> >Yes, it is, but it is current practice and I merely document it. If we
> >were starting from scratch with the benefit of hindsight then obviously
> >we wouldn't have done it this way.
> >
> >I think it's unambiguous for all languages where we actually have
> >existing manual pages to worry about.
> 
> I don't like the wording.  Now it seems that UTF-8 is superior
> to other encoding, but we should take UTF-8 as the ultimate
> encoding.  I propose a simple "non-UTF-8 character".
> Anyway this is a very minor point.

I'm not sure this is the right place to debate UTF-8's superiority to
earlier 8-bit encodings such as ISO-8859-1 or the double-byte character
sets. I think it's self-evident while it's not clear that you do, and
this doesn't seem like the place to reach agreement on that. I also
don't think in this case that we need to be afraid to adopt the best
available encoding now for fear that a better one might come along
later; should that happen, we can simply move along gradually to it and
have man recode on the fly, just as I'm proposing we do here.

Sure, we can say "non-UTF-8" rather than "legacy", though I think policy
should be unafraid to take a strong stance on this. I borrow the
"legacy" term from Unicode advocates such as Markus Kuhn. I think it's
quite an accurate and justified description of the encodings that are
only useful for one or a small number of languages.

> >>> 3. man-db 2.5.0-1 moves into testing.
[...]
> >I should clarify that /usr/share/man/<ll>.UTF-8/ will be used by man for
> >all <ll>* locales, not merely for those where the user requested UTF-8;
> >man will recode to the appropriate character set on the fly.
[...]
> "man will recode to the appropriate character set on the fly.",
> so on point 3, you should mention also a new "man" version.

"3. man-db 2.5.0-1 moves into testing."

  $ ls -l /usr/bin/man
  lrwxrwxrwx 1 root root 17 2007-08-26 23:29 /usr/bin/man -> ../lib/man-db/man
  $ dpkg -S /usr/lib/man-db/man
  man-db: /usr/lib/man-db/man

This is the second time in this thread that you've apparently forgotten
to do basic fact-checking before posting. Could you please adjust your
behaviour here? This is getting a little tedious.

> I like UTF-8, but I don't like that we set UTF-8 as
> predefinite debian encoding.
> And in such case, I would set a default policy (not only
> for manpages, for debian/changelog, ...).

Policy is already moving in the direction of a default here. See the
footnote to section C.2.2 (which recommends UTF-8 for changelogs):

  I think it is fairly obvious that we need to eventually transition to
  UTF-8 for our package infrastructure; it is really the only sane
  char-set in an international environment. Now, we can't switch to
  using UTF-8 for package control fields and the like until dpkg has
  better support, but one thing we can start doing today is requesting
  that Debian changelogs are UTF-8 encoded. At some point in time, we
  can start requiring them to do so.

> Anyway, IIRC there was some negative comment about email
> in UTF-8, in the discussion about DPL vote and wrong
> MUA handling of signed UTF-8 vote.

E-mail is a difficult case because some mail user agents are stuck in a
bygone age, but that is not comparable to the case of a tree of files
for use essentially by a single program under our clear control.

I don't wish to be arrogant here, but I have six years of practical
experience implementing this kind of stuff in man-db (obviously with
lots of help from experts in particular languages etc.). I do not want
to deal with speculative worries that aren't even about the same
subsystem. For the purposes of this proposal, please restrict your
concerns to real examples regarding manual pages, not half-remembered
comments about e-mail.

> Do you think it is feasible to convert manpage on UTF-8,
> from the non-latin alphabet?
> For this point we should see commentary on i18n list

Yes, I do. The Debian CJK patch to groff already implements CJK
encodings (the only case that presents any kind of problem here, to my
knowledge) by converting them to UCS-2 internally and then back to the
source encoding for output. If there is a problem with the conversion,
which as far as I have heard there is not right now, then we would
already be encountering it.

The only other non-Latin encoding currently supported by man-db in
Debian is KOI8-R. Since it's a simple 8-bit encoding, I doubt there is
any kind of round-trip problem with Unicode, and I have not heard of
one.

Though the CC hasn't been preserved, I CCed debian-i18n on my initial
bug report, so I hope they're aware of this proposal. I have reinstated
the CC here.

> >>So I propose that manpage specify a charset (i.e. not using the defaul
> >>local with only the language (and territory)).
> >
> >That is what I'm doing here. The character set named in the directory
> >name specifies the encoding for all manual pages installed under that
> >directory; it does not mandate that only users of that character set may
> >use these manual pages. (I understand your confusion since this is not
> >what is implemented in current man-db, but frankly that implementation
> >doesn't benefit anyone.)
> 
> But you propose only "UTF-8" encoding.

I propose that policy should standardise that we move to using UTF-8 as
the source encoding for all manual pages since it clearly makes sense to
do so. This will still need to be specified by each manual page (by
means of the directory in which it is installed), and it does *not*
affect what user locales are supported in any way. The
internationalisation changes in man-db 2.5.0 will arrange for users to
see pages in their native language when they did not before; I do not
expect it to cause any users to fail to see pages in their native
language when they previously did.

Once man-db 2.5.0 is in place, the change in policy to recommend
installing pages with UTF-8 encoding in a properly marked directory will
have *no* effect on users, no matter what their locale. It is purely for
improved maintenance of the system.

> Unfortunately Debian is no more the upstream of man-db.

Excuse me! I'm sorry, but on this point you seem to be quite rude. *I*
am the upstream for man-db, and I do so wearing my Debian developer hat
and using my @debian.org address. After Fabrizio's death in 2001, when I
took over as Debian maintainer of man-db, I contacted Graeme Wilford
informing him of my wish to take over as upstream; I received a reply in
mid-April giving me permission. I released man-db 2.3.18 in May 2001,
and since then have made seven further upstream releases, the last one
being in February of this year.

I use the Debian bug tracking system for upstream purposes, typically
take account of Debian release cycles when doing upstream development,
and upload new upstream versions to Debian promptly. The only thing I
don't do is use the native packaging format, which was really never a
particularly good idea for man-db and which I don't find helpful in this
case. If I as a Debian developer am not the upstream maintainer for
man-db, I should very much like to know who is.

Please retract this misstatement. The most cursory examination of
/usr/share/doc/man-db/copyright would have overturned it. What was the
point of saying that, anyway?

> In summary, now I'm ok with your proposal.
> I don't like the "hardcoded" UTF-8, and I'm not sure that
> an automatic conversion is featible for some non latin alphabet.
> But it is the only clean and reasonable solution.

Thanks. I hope that my comments above clarify some further confusion. I
would still appreciate concrete information and examples on why you
don't like the idea of manual pages being installed in UTF-8 (noting
that as a package maintainer or a translator you wouldn't have to
actually edit it in that encoding if you didn't want to, it doesn't have
to be done urgently or on any kind of flag day, I have addressed the
non-Latin concern above, and it will not have a negative effect on users
of non-UTF-8 locales).

Regards,

-- 
Colin Watson                                       [cjwatson@debian.org]
Reply to:
Follow-Ups:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Jens Seidel <jensseidel@users.sf.net>
References:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
Prev by Date: Bug#440420: [PROPOSAL] Manual page encoding
Next by Date: Bug#440420: [PROPOSAL] Manual page encoding
Previous by thread: Bug#440420: [PROPOSAL] Manual page encoding
Next by thread: Bug#440420: [PROPOSAL] Manual page encoding
Index(es):
- Date
- Thread