[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 manual pages



On Fri, Oct 12, 2007 at 08:51:31AM +0900, Junichi Uekawa wrote:
> Hi,
> >         o Per-locale directory handling has been improved. Directories such
> >           as "fr.UTF-8" may be used for occasions when it is appropriate to
> >           specify the character set but not the country, and so a full
> >           locale name is inconvenient.
> > 
> >         o There is a new "manconv" program which can try multiple possible
> >           encodings for a file, thus allowing UTF-8 manual pages to be
> >           installed in any directory even without an explicit encoding
> >           declaration.
> 
> This is cool.
> 
> A great workaround for that compatibility mess RedHat has created for US.
> 
> I assume UTF-8 / local-encoding detection can fail sometimes; which
> encoding has precedence?

You're right, it can. It's much more likely that a random non-UTF-8
document will fail to decode as UTF-8 than the other way round, so man
tries UTF-8 first and that will take precedence.

I did just notice a bug in manconv's detection which I've fixed for
2.5.1. With that bug fixed, the only circumstances in which a page will
be decoded incorrectly should be if it is not valid UTF-8 but contains
some text which looks like valid UTF-8 in the first 64KB. I don't know
of an example of this happening in practice. The only hard case you get
in practice is a very large mostly-ASCII page with some ISO-8859-1 near
the end (maybe in an author's name), and manconv handles that fine.

However, if there is still ambiguity due to this, you can either install
the page in a directory name that's explicitly tagged with an encoding
(another reason I'd like to do that by default, as otherwise we get a
few pages that are put there anyway to disambiguate) or use a coding:
declaration in the file. This is documented in manconv(1).

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: