[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8

On Sun, Aug 12, 2007 at 08:12:24PM +0900, Osamu Aoki wrote:
> On Sun, Aug 12, 2007 at 08:09:06PM +1000, Ben Finney wrote:
> > There's an important difference between "beat the program with a large
> > cluestick" and "beat the person with a large cluestick". Adam's
> > assertion was only that the former was necessary.
> Technically true.

I'm terribly sorry if my post could be read as insulting the _maintainer_. 
>>> "If man-db does this, it needs to be beaten with a large cluestick."
I believe this line was quite clear, but if it was not, please accept my
apologies.  The maintainer is fine, the state of man pages is not.

> > If the person with the necessary clue is the package maintainer, they
> > are more than welcome to issue the beating upon the software. This
> > isn't an arrogant or insulting statement, because it's the software
> > that is being declared clueless, not the person.
> But that still takes volunteer time and efforts for small benefit.

Please.  Don't say that avoiding the encoding hell is a "small benefit"
anywhere someone from Poland or most eastern European countries.  For
example, most of old (>10 years old) data I see is encoded in Mazovia, then
it's often win852, then win1250.  Compared to the mix, UTF-8 is a silver

The only real way to get rid of the encoding hell is to move everything into
a single common encoding.  
> Even if it was UTF-8 encoded, if you do not have proper font installed,
> you get "TOFU"(white box) on your screen.  Please note man-db does
> $ LC_ALL=ja_JP.UTF-8 man man
> and displays correct text in English UTF-8 locale console if one has
> Japanese font installed.

If you can read Japanese, you're going to have Japanese font installed.  If
you don't, you won't really care.  Yet, this is a large issue for those of
us who use Latin scripts with characters not included in 8 bit charset of
someone's choice.  If you don't know what is the difference between "n" and
"ń", you should still get it rendered at least as a "n".  Iconv is perfectly
capable of this feat (by //translit), yet it needs to get non-mangled data
in in the first place.
> (The source data is still in eucJP).

And in my opinion that's exactly the source of problems we are seeing.

> I was not comfortable since the poster did not even check the fact first
> and bashed current quality of the software.  The quality of software is
> closely related to and inseparable from its upstream and its maintainer,
> i.e., Colin in my eyes.

The software doesn't support the official Debian's charset, that's the
problem.  Support for ancient encodings is optional, support for Unicode is
not.  So the current quality of the software is bad.

Is this the maintainer's fault?  Not really -- this is a large task, and it
needs to see more effort thrown at it.

So, let me start.

Current state

The man pages have no encoding markers inside, encoding is currently derived
partially from the language and partially from the user's charset.  Unless
they happen to be the same, manpages will be mangled unless they're written
solely in 7-bit ASCII.

Existing manpages

Among the manpages installed on the desktop box I'm writing these words on,
I've got:
4511 purely 7-bit files
778 ones encoded in legacy charsets
23 wrongly (prematurely) encoded in UTF-8

Of course, those 23 ones are rendered similar to:
	"Michel Dänzer" instead of "Michel Dänzer" in radeon(4)
	"â’in 384u-216u" instead of "…" in piuparts(1)

Issues to fix

A. man output
B. groff processing
C. man input

Fixes for A. and B. are mostly local to "man-db", fixing C. would be a
Debian-wide issue.

My proposal for A. and B.

Unless someone comes up with a better idea, let's completely drop all
support for non-UTF-8 locales (but read on) all the way on.  That would
eliminate the current complexity, leaving only one charset to maintain. 
Only at the very last stage the output would be passed to iconv
//translit (_not_ iconv -c).

The difference between //translit and -c is, the latter will blackhole
characters while the former tries to find substitutes among the target
charset first before resorting to using a question mark.

German text: "Michel Dänzer" will yield "Michel Danzer"
Polish text: "Gdański" -> "Gdanski"
Japanese text: "????? ??? ??? ????" -- at least you know something's there

Source files

As there are no markers inside the troff files, we can resort to a hack.  If
you check if the input is a properly encoded UTF-8 file, you can sometimes
mistakenly recognize a file in a legacy charset as UTF-8.  Yet, I ran a
check on 1059 packages installed, and there wasn't a single false positive.
Thus, it's a way to get proper UTF-8 support as of tomorrow.

My proposal for C.

1. Let's check the whole archive for false positives.  Only if any packages
   would be found would there be any need for a transition.
2. Change "man" so it reads the troff file, checks if it's valid UTF-8 and
   if it is not, refers to the current table of hard-coded 8-bit charsets.
3. Hurray!  You can upload UTF-8 manpages now.
4. Lintian checks, policy changes, etc...
5. (Lenny+1?  Lenny+2?) The hack and support for ancient manpages can be

How does this sound?
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to: