Re: UTF-8 manual pages
On Fri, Oct 12, 2007 at 08:51:31AM +0900, Junichi Uekawa wrote:
> Hi,
> > o Per-locale directory handling has been improved. Directories such
> > as "fr.UTF-8" may be used for occasions when it is appropriate to
> > specify the character set but not the country, and so a full
> > locale name is inconvenient.
> >
> > o There is a new "manconv" program which can try multiple possible
> > encodings for a file, thus allowing UTF-8 manual pages to be
> > installed in any directory even without an explicit encoding
> > declaration.
>
> This is cool.
>
> A great workaround for that compatibility mess RedHat has created for US.
>
> I assume UTF-8 / local-encoding detection can fail sometimes; which
> encoding has precedence?
You're right, it can. It's much more likely that a random non-UTF-8
document will fail to decode as UTF-8 than the other way round, so man
tries UTF-8 first and that will take precedence.
I did just notice a bug in manconv's detection which I've fixed for
2.5.1. With that bug fixed, the only circumstances in which a page will
be decoded incorrectly should be if it is not valid UTF-8 but contains
some text which looks like valid UTF-8 in the first 64KB. I don't know
of an example of this happening in practice. The only hard case you get
in practice is a very large mostly-ASCII page with some ISO-8859-1 near
the end (maybe in an author's name), and manconv handles that fine.
However, if there is still ambiguity due to this, you can either install
the page in a directory name that's explicitly tagged with an encoding
(another reason I'd like to do that by default, as otherwise we get a
few pages that are put there anyway to disambiguate) or use a coding:
declaration in the file. This is documented in manconv(1).
Cheers,
--
Colin Watson [cjwatson@debian.org]
Reply to: