Re: UTF-8 locales

To: debian-i18n@lists.debian.org
Cc: debian-devel@lists.debian.org
Subject: Re: UTF-8 locales
From: Tomohiro KUBOTA <tkubota@riken.go.jp>
Date: Mon, 20 Nov 2000 19:25:11 +0900
Message-id: <[🔎] 87y9yfdjp4.wl@surfchem0.riken.go.jp>
In-reply-to: In your message of "Mon, 20 Nov 2000 01:11:02 -0700" <[🔎] 20001120011102.B31142@lovelife.olvc.ab.ca>
References: <[🔎] 87r94gqd2e.wl@surfchem0.riken.go.jp> <[🔎] 200011131854.DAA16802@smtp5.dti.ne.jp> <[🔎] 87u29928rd.wl@surfchem0.riken.go.jp> <[🔎] 20001116004510.A3138@debian.org> <[🔎] 20001116094026.A12204@daisy.vocalis.com> <[🔎] 20001118225558.A1180@debian.org> <[🔎] 20001118200111.A12372@x8b4e516e.dhcp.okstate.edu> <[🔎] 20001119225054.A14582@lina.inka.de> <[🔎] 8766ljfkwy.wl@surfchem0.riken.go.jp> <[🔎] 20001120011102.B31142@lovelife.olvc.ab.ca>

Hi,

At Mon, 20 Nov 2000 01:11:02 -0700,
Anthony Fok <foka@debian.org> wrote:

> To add to that list, China has the new GB18030-2000 standard
> (locale zh_CN.GB18030) which also contains many characters beyond Unicode.

Interesting.  I will have to mention it in my "Introduction to I18N"
document in Debian Documentation Project.  (Now under grand rewriting).
Please check http://www.debian.org/doc/manuals/intro-i18n/

BTW, I think GB18030 would be a _character set_, not _encoding_.
If so, we won't have zh_CN.GB18030 locale.

Examples (Japanese):
   JIS X 0201, JIS X 0208, JIS X 0212, JIS X 0213 are _character set_.
   EUC-JP, Shift-JIS, ISO-2022-JP are _encoding_.
For simplified Chinese:
   GB 2312, GB 7589, GB 7590, GB 8565, GB 12052, GBK, are _character set_.
   CN-GB (aka EUC-CN), GBK, ISO-2022-CN, are _encoding_.
For traditional Chinese:
   BIG5, CNS 11643, are _character set_.
   ISO-2022-CN, ISO-2022-CN-EXT, EUC-TW, BIG5, are _encoding.

Codes which are not ISO2022-compliant tend not to separate
_character set_ and _encoding_.

> Very much so in Chinese.  In fact, the Chinese government has gone as far as
> to ban the sale of any Chinese software that only supports Unicode starting
> in 2001.  All new Chinese software must support the GB18030-2000 character
> set.  And yes, Microsoft will have to comply too; their current Unicode-only
> solution won't work.  (Ho ho ho!)  Apparently, the Chinese government is
> somewhat displeased to have the Chinese language controlled and *limited*
> by an International Consortium like Unicode.  There are *so* many Chinese
> characters that aren't in the 16-bit Unicode that it would create lots of
> trouble if Unicode were to become the de-facto standard in China. 
> GB18030-2000 is compatible with ISO-10646 AFAIK.

How severe!  Can a government have such a right?

However, this sounds nice also for Japanese people.  Softwares on
POSIX systems will use locale and wide characters instead of Unicode
and UTF-8, since this is the easiest way to support both of GB18030
and UTF-8. And UNIX vendors will work hard to support locale mechanisms.
Then, usage of locale and wide characters concludes into support of
encodings such as EUC-JP, ISO-2022-JP, Shift-JIS, and so on.

I will be right, _if GB18030 won't included in Unicode_.  However, I
think GB18030 will be included in Unicode in future, if GB18030 is a
character set, not an encoding.

> Similar concerns are in Taiwan, and indeed many characters are only in
> CNS11643 (and ISO-10646) but not in Unicode.
> 
> Of course, these are mostly heresay.  I don't know the details, as I was
> originally from Hong Kong, and I have been living in Canada for over 10
> years.  But speaking of Hong Kong, there are quite a few Chinese characters
> added by the HKSAR government that won't be in Unicode either.  So yeah,
> though I am bemused, I am kind of glad that the Chinese government take such
> a strong stance to force software support the new GB18030-2000 standard,
> which, like ISO-10646, has space for millions of characters.  :-)

ISO-10646 and Unicode share exactly the same character set and will 
do also in future, though the width of code space is different 
(ISO-10646: 31bit, Unicode: 0x000000 - 0x10ffff [a bit more than 
20bit] ).

I suppose you misunderstand that Unicode is 16bit, though it is true
that Unicode (1.0) _was_ 16bit.

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/

Reply to:

Follow-Ups:
- Re: UTF-8 locales
  - From: Roger So <rogerso@sis.dhs.org>

References:
- UTF-8 locales
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: UTF-8 locales
  - From: GOTO Masanori <gotom@debian.or.jp>
- Re: UTF-8 locales
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: UTF-8 locales
  - From: Nicolás Lichtmaier <nick@debian.org>
- Re: UTF-8 locales
  - From: Edmund GRIMLEY EVANS <edmundo@rano.org>
- Re: UTF-8 locales
  - From: Nicolás Lichtmaier <nick@debian.org>
- Re: UTF-8 locales
  - From: David Starner <dvdeug@x8b4e516e.dhcp.okstate.edu>
- Re: UTF-8 locales
  - From: Bernd Eckenfels <lists@lina.inka.de>
- Re: UTF-8 locales
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: UTF-8 locales
  - From: Anthony Fok <foka@debian.org>

Prev by Date: ITP: FreeType 2.0 (freetype6, freetype6-dev)
Next by Date: Re: ITP: FreeType 2.0 (freetype6, freetype6-dev)
Previous by thread: Re: UTF-8 locales
Next by thread: Re: UTF-8 locales
Index(es):
- Date
- Thread