[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8



On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote:
> (Colin, CC-ing you as I'm not sure if you're of aware of this long thread,
> and both man-db and groff are your territory...)

I wasn't aware of it, thanks. Sorry for my delay in responding.

I read through the thread and there are a number of things that I might
have responded to individually had I been following it at the time, but
now it's probably less confusing if I just reply to this.

> On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: 
> > I proposed Colin to work on it during Debconf, but still had no time to do
> > it.
> 
> Could you tell us if anything was born?

I think the best summary of the current status and planning is this
policy proposal, on which I'd very much appreciate further comments,
since people involved in this thread seem to have a good grip on the
issues:

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440420

David Given is pretty much spot-on in:

  http://lists.debian.org/debian-mentors/2007/08/msg00308.html

I had hoped to talk about this at Debconf, but unfortunately the
software just wasn't ready yet.

> > I tested a CVS snapshot of groff
> 
> On the other hand, I investigated what the headgear guys produced.  I just
> compiled the package on Debian instead of using a real Red Hat system, so
> due to misconfiguration things may be better than I'm reporting here.

I do need to find the stomach to look at upgrading groff again, but it's
not *necessary* (or indeed sufficient) for this. The most important bit
to start with is really the changes to man-db.

The downside to not upgrading groff, as you note, is that you'll only be
able to actually use those codepoints which appear in the legacy
character set corresponding to the language you're using (e.g. German
manual pages will only be able to use Unicode codepoints that are
convertible to ISO-8859-1). This is annoying and I fully agree that it's
a bug, but it's not urgent, and I want to get over the first phase of
the transition before worrying about that.

> > (Except for this issue, I could display nicely French, English, Japanese
> > and Vietnamese UTF-8 manpages)
> 
> Cool, and what for Cyrillic, Arabic, Indic, etc?

Cyrillic works fine right now, and has done for years, via the ascii8
device in Debian groff. Arabic and Indic will probably still be a bit
screwed.

> > The CVS version introduced a -K option to specify the encoding
> > of the input file to groff. This should help to plan a transition for UTF-8
> > manpages by using this option in man-db.
> 
> Wouldn't it be easier and less prone to breakages if we simply hard-coded
> the encoding as UTF-8 and do all the processing in man-db?  A versioned
> dependency or conflict would be enough, and the code would be much simpler.
> 
> > Slowly moving files from man/ to man.UTF-8/ while still supporting the
> > legacy encoding in man/ would be a simple transition plan.
> 
> I'm afraid that's not an option.  So far I found 807 UTF-8 man pages, and
> only 21 of them were marked as such.  But fear not, it appears I've got a
> solution working, just let me download the rest of archive to check it.

Compatibility's the thing here. You're right that there are a lot of
pages in UTF-8 and not marked as such (there are 1308 or so in
manpages-es alone), but that's a relatively recent phenomenon.
Historically, and even up until a year or two ago, pages installed in
/usr/share/man/$LL/ had a fixed encoding which man-db could rely on
(basically ISO-8859-1 with a few exceptions which were handled specially
by man-db, the ones under the MULTIBYTE_GROFF define). Those that have
moved to UTF-8 without changing directory have clearly not been tested
on Debian since they don't work, and so I have no compunction about
codifying that breakage; but I won't break the pages that were installed
using the proper encoding and always worked to date. I'm also not really
keen on requiring everyone who installs a UTF-8 manual page to declare a
versioned conflict on man-db; that's a lot of arcs in the dependency
graph.

By contrast, moving to /usr/share/man/fr.UTF-8/ etc. for UTF-8 manual
pages is easy to describe and understand, and it doesn't break
compatibility. The worst case is that the manual page goes missing until
you upgrade man-db, but you don't get misencoded garbage, and upgrading
man-db first means that everything keeps working at least as well as
before. Once we have a better version of groff, we can just tweak
man-db's iconv pipeline handling and once again it will keep on working
at least as well as before.

> Due to Red Hat and probably other dists using UTF-8 already, plenty of man
> pages are in UTF-8 when our groff still can't parse them.  Having gone
> through 2/3 of the archive, I got 807 such pages so far.  And every single
> one displays lovely "ä" or similar instead.  That's 9% of all mans with
> non-ASCII characters in the corpus.

These are bugs in the packages in question, and it would be
straightforward for the maintainers to correct that even with current
man-db just by using 'iconv -c' in debian/rules, if they'd actually
tested those manual pages. I attempted to clarify that in the first half
of the policy bug report above.


The good news is that man-db 2.5.0 is coming along very well, and I may
well have it complete before the initial policy amendment I proposed is
accepted; that's fine and in that event I'll supersede it with another
proposal and go straight to the transition plan which allows people to
start using UTF-8 manual pages properly. I issued a pre-release version
recently. It's probably best to check it out using bzr:

  http://www.chiark.greenend.org.uk/~cjwatson/bzr/man-db/trunk/

There's nothing left on my to-do list for 2.5.0 but to wait a decent
length of time for translations and test it to death in the meantime.

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: