Re: Bug#99933: Comments on Unicode

To: "Raul Miller" <moth@debian.org>, <debian-i18n@lists.debian.org>
Cc: <99933@bugs.debian.org>
Subject: Re: Bug#99933: Comments on Unicode
From: "David Starner" <dstarner98@aasaa.ofe.org>
Date: Sun, 8 Jul 2001 21:05:40 +0100
Message-id: <[🔎] 000e01c107e9$7d7e7b20$ae4efea9@dvdeug>
References: <010201c10517$164398c0$ae4efea9@dvdeug> <20010705133736.C12776@usatoday.com> <20010706112341.Q2483@kukkaruukku.keltti.jyu.fi> <010201c10517$164398c0$ae4efea9@dvdeug> <20010705133736.C12776@usatoday.com> <02a801c105cc$d6343ee0$ae4efea9@dvdeug> <994422263.69160713@debian.org>

----- Original Message -----
From: Raul Miller <moth@debian.org>
Subject: Re: Bug#99933: Comments on Unicode

> On Fri, Jul 06, 2001 at 04:36:25AM +0100, David Starner wrote:
> > > Once unicode can act as a super set for every character set we
currently
> > > support, we can use it as such.  Until then, we can't.
> >
> > If Unicode were a super set for every character set that anyone needs to
> > support, it would be worthless and completely unusable.
>
> I didn't say for any character set that anyone needs to support.
> I said for every character set we currently support.  I hope you see the
> difference.

With my Debian hat on, of course I see the difference. With my Unicode hat
on, there is no difference. Every small group and company has their own
character sets that they need supported, and Debian's just another group.
Note that Unix locales tend to prefentially use standardized character sets
(JIS X 0218, ISO-8859-*) which ISO 10646 had to superset completely.

If you have a recent version of locales installed, look in
/usr/share/i18n/charmaps, which has every character set we support for use
in iconv or locales. For actual locale charsets, look in /etc/locale.gen. If
you remove ISO-8859-* (which are all Unicode compatible) and remove UTF-8,
you're left with 11 charsets: cp1251, tis-620, koi8-r, koi8-u, euc-tw,
euc-jp, gb2312, gb18030, gbk, big5, and big5hks. 3 of these have problems:
euc-tw, big5 and big5hks. All three have characters that can't be reversably
mapped to Unicode and back. euc-tw shouldn't be a problem, as its
irreversable mappings are due to duplication of an entire CNS plane of
characters, apparently due to an encoding quirk. big5 has some characters
mapped to private use segments; I don't know if this is because glibc
doesn't use Unicode 3.1 yet, or if that represents a private use segment in
big5 (the characters are contigious), or if they haven't been encoded in
Unicode yet. (Unlikely, IMO).

--
David Starner - dstarner98@aasaa.ofe.org, dvdeug@debian.org

Reply to:

Prev by Date: Re: "Sitemap" webpage
Next by Date: New Unifont release
Previous by thread: Re: "Sitemap" webpage
Next by thread: New Unifont release
Index(es):
- Date
- Thread