Re: Man pages and UTF-8

To: debian-mentors@lists.debian.org
Subject: Re: Man pages and UTF-8
From: Adam Borowski <kilobyte@angband.pl>
Date: Wed, 15 Aug 2007 01:43:43 +0200
Message-id: <[🔎] 20070814234343.GB19877@angband.pl>
In-reply-to: <[🔎] 87mywtr9pu.fsf@windlord.stanford.edu>
References: <[🔎] 46BC35D3.2000302@cowlark.com> <[🔎] 20070814152527.GA10327@nekral.homelinux.net> <[🔎] 20070814225053.GA13437@angband.pl> <[🔎] 87mywtr9pu.fsf@windlord.stanford.edu>

On Tue, Aug 14, 2007 at 04:13:17PM -0700, Russ Allbery wrote:
> Adam Borowski <kilobyte@angband.pl> writes:
> 
> > Any such description file would work only as long as you hard-code any
> > fonts, and somehow provide them for any potential reader.  Without this,
> > wcwidth() is as good as you can get for fixed-width fonts.  For
> > comparison, Red Hat makes a wild assumption that everything u0800..uFFFF
> > is doublewide.
> 
> The correct thing to do is to use the information from the latest version
> of the Asian character width property table:
> 
>     http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

> u0800..uFFFF is a bad approximation that misses several ranges and is
> actually wrong for most of the range up to u1100.

Except, the few scripts supported by old groff (Japanese, Chinese, Latin1,
...) have no characters handled wrongly in that range.  It becomes a bug
only when we start to provide support for Indic and so on -- which we
certainly want.

> For another application, I use the approximation of:
>
> our @WIDE = qw(\x{2E80}-\x{303E} \x{3041}-\x{33FF} \x{4E00}-\x{9FBB}
>                \x{AC00}-\x{D7A3} \x{FF01}-\x{FF60});

Heh.  Similar here, I used

#define isw2width(x) ((x)>=0x1100  && ((x)<=0x11ff ||   \
                      (x)>=0x2e80) && ((x)<=0xd7ff ||   \
                      (x)>=0xf900) && ((x)<=0xfaff ||   \
                      (x)>=0xfe30) && ((x)<=0xfe6f ||   \
                      (x)>=0xff01) && ((x)<=0xff60 ||   \
                      (x)>=0xffe0) && ((x)<=0xffe6 ||   \
                      (x)>=0x20000) && (x)<=0x2ffff)

for portability to not depend on GNU-only wcwidth(), before I just copied in
wcwidth.c

> 
> but even that is not a particularly good approximation compared to using
> the real table.
> 
> My guess is that wcwidth's answer is based on the latest version of that
> table at the time that glibc released, although I'd have to double-check
> to be sure.

Yes: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
It has two different answers, based on the new and historic handling of
CJK Ambiguous characters.  The new handling actually boils down to:

    (ucs >= 0x1100 &&
     (ucs <= 0x115f ||                    /* Hangul Jamo init. consonants */
      ucs == 0x2329 || ucs == 0x232a ||
      (ucs >= 0x2e80 && ucs <= 0xa4cf &&
       ucs != 0x303f) ||                  /* CJK ... Yi */
      (ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
      (ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
      (ucs >= 0xfe10 && ucs <= 0xfe19) || /* Vertical forms */
      (ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
      (ucs >= 0xff00 && ucs <= 0xff60) || /* Fullwidth Forms */
      (ucs >= 0xffe0 && ucs <= 0xffe6) ||
      (ucs >= 0x20000 && ucs <= 0x2fffd) ||
      (ucs >= 0x30000 && ucs <= 0x3fffd)));

so our approximations were not far off.  wcwidth() though will return 0 for
combining characters -- something important if support for Indic is a goal.

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

References:
- Man pages and UTF-8
  - From: David Given <dg@cowlark.com>
- Re: Man pages and UTF-8
  - From: Nicolas François <nicolas.francois@centraliens.net>
- Re: Man pages and UTF-8
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Man pages and UTF-8
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Re: Duplicate file names? (was Re: RFS: ifstat (updated package))
Next by Date: RFS: ladr and prover9-manual
Previous by thread: Re: Man pages and UTF-8
Next by thread: Re: Man pages and UTF-8
Index(es):
- Date
- Thread