[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Man pages and UTF-8


On Fri, Aug 10, 2007 at 01:23:02PM +0200, Adam Borowski wrote:
> On Fri, Aug 10, 2007 at 11:24:08AM +0100, David Given wrote:
> > Ben Finney wrote:
> > [...]
> > > That sounds like a bug. I was under the impression that the default
> > > encoding of everything in lenny was supposed to be UTF-8.

I wish.  Many documentations are still in old encodings...

> > > What tool is it that has this different default encoding?
> > 
> > Well, I tried UTF-8 with the assumption that it would work, and it threw up a
> I would call this a bug, in Etch it was "only" "important".
> ANY file on a modern system installed by the distribution (and not in the
> user's private data, /mnt/win/ or an upstream source tarball) is bad for a
> number of reasons, mangling people's surnames being one of less important
> ones.
> All data files should be in UTF-8 (or UCS4, or any other format which does
> not include data loss).  If an user then chooses to use a broken charset due
> to his/her historic preferences, so be it -- but you cannot inflict data
> loss on others.  If man-db does this, it needs to be beaten with a large
> cluestick.

I think the maintainer of man-db is well aware and has more than enough
"clue".  (The satatement like above without checking the fact is nothing
but arrogance and should be avoided to be a good debian volunteer.)

If you have time and skill, please provide patch and exact transition
plan to the BTS.  To me, it looks like Colin has tools getting ready.

As I see changelog ...

> Thu Aug 10 17:23:03 BST 2006  Colin Watson  <cjwatson@debian.org>
>         * src/encodings.c (get_default_device): Always use utf8 if preconv
>           is available.
>           (get_roff_encoding): Skip CJK UTF-8 hack if preconv is available.
>         * src/man.c (make_roff_command): Use preconv if available to recode
>           input even if the encoding is detected by means other than looking
>           at the preprocessor line. Skip iconv preprocessing in that case.

The current text data may use non-UTF-8 but the tool is internally
running with UTF-8 data.  (I did not check the source any further the
above.  I vaguely remember that Colin posted something about UTF-8
transition plan before)



PS: Please be reminded that even UTF-8 encoded text data which can only
access UCS codes is not without "data loss".  The selection of UCS codes
for glyphs was a practical compromize.  They assigned a same code to
several glyphs sharing some history.  (This is mostly
Chinese-Japanese-Korean issue which have huge number of glyphs.)

Attachment: signature.asc
Description: Digital signature

Reply to: