Re: UTF-8 manual pages

To: debian-devel@lists.debian.org
Subject: Re: UTF-8 manual pages
From: Adam Borowski <kilobyte@angband.pl>
Date: Fri, 12 Oct 2007 15:21:48 +0200
Message-id: <[🔎] 20071012132148.GA30157@angband.pl>
In-reply-to: <[🔎] 20071012110356.GB13982@riva.ucam.org>
References: <[🔎] 20071011112218.GC1666@riva.ucam.org> <[🔎] 87ejg11a7w.dancerj%dancer@netfort.gr.jp> <[🔎] 20071012110356.GB13982@riva.ucam.org>

On Fri, Oct 12, 2007 at 12:03:56PM +0100, Colin Watson wrote:
> On Fri, Oct 12, 2007 at 08:51:31AM +0900, Junichi Uekawa wrote:
> > I assume UTF-8 / local-encoding detection can fail sometimes; which
> > encoding has precedence?
> 
> You're right, it can. It's much more likely that a random non-UTF-8
> document will fail to decode as UTF-8 than the other way round, so man
> tries UTF-8 first and that will take precedence.
> 
> I did just notice a bug in manconv's detection which I've fixed for
> 2.5.1. With that bug fixed, the only circumstances in which a page will
> be decoded incorrectly should be if it is not valid UTF-8 but contains
> some text which looks like valid UTF-8 in the first 64KB. I don't know
> of an example of this happening in practice. The only hard case you get
> in practice is a very large mostly-ASCII page with some ISO-8859-1 near
> the end (maybe in an author's name), and manconv handles that fine.

I went through the whole archive a couple of months ago (except for packages
with no binary for i386), and there's not a single false positive if you
read the whole file; reading first 64KB is wrong in three cases but Colin
handled them another way.

It is possible but very unlikely that locale encoding can fail in the
future, but there are two workarounds:
1) explicitely specify the encoding
2) use UTF-8

I would like to point you to the second solution.  In fact, it's long
overdue, and I really think that debian/changelog and debian/control should
be joined by /usr/share/man/ and /usr/share/doc/ in the mandatory Unicode
club.  Preferably as of right now.

Having more than one encoding in the wild = guaranteed lossage.

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

References:
- UTF-8 manual pages
  - From: Colin Watson <cjwatson@debian.org>
- Re: UTF-8 manual pages
  - From: Junichi Uekawa <dancer@netfort.gr.jp>
- Re: UTF-8 manual pages
  - From: Colin Watson <cjwatson@debian.org>

Prev by Date: Re: Handling of poorly maintained and useless packages
Next by Date: Re: Testing parallel builds
Previous by thread: Re: UTF-8 manual pages
Next by thread: Re: UTF-8 manual pages
Index(es):
- Date
- Thread