Re: Man pages and UTF-8

To: Adam Borowski <kilobyte@angband.pl>
Cc: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: Osamu Aoki <osamu@debian.org>
Date: Sun, 12 Aug 2007 10:50:34 +0900
Message-id: <[🔎] 20070812015034.GA9322@debian.org>
Mail-followup-to: Adam Borowski <kilobyte@angband.pl>, debian-mentors@lists.debian.org
In-reply-to: <[🔎] 20070810112302.GA26779@angband.pl>
References: <[🔎] 46BC35D3.2000302@cowlark.com> <[🔎] 87mywz4s5c.fsf@benfinney.id.au> <[🔎] 46BC3CC8.1000004@cowlark.com> <[🔎] 20070810112302.GA26779@angband.pl>

Hi,

On Fri, Aug 10, 2007 at 01:23:02PM +0200, Adam Borowski wrote:
> On Fri, Aug 10, 2007 at 11:24:08AM +0100, David Given wrote:
> > Ben Finney wrote:
> > [...]
> > > That sounds like a bug. I was under the impression that the default
> > > encoding of everything in lenny was supposed to be UTF-8.

I wish.  Many documentations are still in old encodings...

> > > What tool is it that has this different default encoding?
> > 
> > Well, I tried UTF-8 with the assumption that it would work, and it threw up a
...
> I would call this a bug, in Etch it was "only" "important".
> ANY file on a modern system installed by the distribution (and not in the
> user's private data, /mnt/win/ or an upstream source tarball) is bad for a
> number of reasons, mangling people's surnames being one of less important
> ones.
> 
> All data files should be in UTF-8 (or UCS4, or any other format which does
> not include data loss).  If an user then chooses to use a broken charset due
> to his/her historic preferences, so be it -- but you cannot inflict data
> loss on others.  If man-db does this, it needs to be beaten with a large
> cluestick.

I think the maintainer of man-db is well aware and has more than enough
"clue".  (The satatement like above without checking the fact is nothing
but arrogance and should be avoided to be a good debian volunteer.)

If you have time and skill, please provide patch and exact transition
plan to the BTS.  To me, it looks like Colin has tools getting ready.

As I see changelog ...

> Thu Aug 10 17:23:03 BST 2006  Colin Watson  <cjwatson@debian.org>
> 
>         * src/encodings.c (get_default_device): Always use utf8 if preconv
>           is available.
>           (get_roff_encoding): Skip CJK UTF-8 hack if preconv is available.
>         * src/man.c (make_roff_command): Use preconv if available to recode
>           input even if the encoding is detected by means other than looking
>           at the preprocessor line. Skip iconv preprocessing in that case.

The current text data may use non-UTF-8 but the tool is internally
running with UTF-8 data.  (I did not check the source any further the
above.  I vaguely remember that Colin posted something about UTF-8
transition plan before)

Thanks.

Osamu

PS: Please be reminded that even UTF-8 encoded text data which can only
access UCS codes is not without "data loss".  The selection of UCS codes
for glyphs was a practical compromize.  They assigned a same code to
several glyphs sharing some history.  (This is mostly
Chinese-Japanese-Korean issue which have huge number of glyphs.)

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>

References:
- Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Ben Finney <bignose+hates-spam@benfinney.id.au>
- Re: Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>

Prev by Date: Re: RFS: boa (updated package)
Next by Date: Re: Man pages and UTF-8
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread