[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 locales


At Mon, 20 Nov 2000 01:11:02 -0700,
Anthony Fok <foka@debian.org> wrote:

> To add to that list, China has the new GB18030-2000 standard
> (locale zh_CN.GB18030) which also contains many characters beyond Unicode.

Interesting.  I will have to mention it in my "Introduction to I18N"
document in Debian Documentation Project.  (Now under grand rewriting).
Please check http://www.debian.org/doc/manuals/intro-i18n/

BTW, I think GB18030 would be a _character set_, not _encoding_.
If so, we won't have zh_CN.GB18030 locale.

Examples (Japanese):
   JIS X 0201, JIS X 0208, JIS X 0212, JIS X 0213 are _character set_.
   EUC-JP, Shift-JIS, ISO-2022-JP are _encoding_.
For simplified Chinese:
   GB 2312, GB 7589, GB 7590, GB 8565, GB 12052, GBK, are _character set_.
   CN-GB (aka EUC-CN), GBK, ISO-2022-CN, are _encoding_.
For traditional Chinese:
   BIG5, CNS 11643, are _character set_.
   ISO-2022-CN, ISO-2022-CN-EXT, EUC-TW, BIG5, are _encoding.

Codes which are not ISO2022-compliant tend not to separate
_character set_ and _encoding_.

> Very much so in Chinese.  In fact, the Chinese government has gone as far as
> to ban the sale of any Chinese software that only supports Unicode starting
> in 2001.  All new Chinese software must support the GB18030-2000 character
> set.  And yes, Microsoft will have to comply too; their current Unicode-only
> solution won't work.  (Ho ho ho!)  Apparently, the Chinese government is
> somewhat displeased to have the Chinese language controlled and *limited*
> by an International Consortium like Unicode.  There are *so* many Chinese
> characters that aren't in the 16-bit Unicode that it would create lots of
> trouble if Unicode were to become the de-facto standard in China. 
> GB18030-2000 is compatible with ISO-10646 AFAIK.

How severe!  Can a government have such a right?

However, this sounds nice also for Japanese people.  Softwares on
POSIX systems will use locale and wide characters instead of Unicode
and UTF-8, since this is the easiest way to support both of GB18030
and UTF-8. And UNIX vendors will work hard to support locale mechanisms.
Then, usage of locale and wide characters concludes into support of
encodings such as EUC-JP, ISO-2022-JP, Shift-JIS, and so on.

I will be right, _if GB18030 won't included in Unicode_.  However, I
think GB18030 will be included in Unicode in future, if GB18030 is a
character set, not an encoding.

> Similar concerns are in Taiwan, and indeed many characters are only in
> CNS11643 (and ISO-10646) but not in Unicode.
> Of course, these are mostly heresay.  I don't know the details, as I was
> originally from Hong Kong, and I have been living in Canada for over 10
> years.  But speaking of Hong Kong, there are quite a few Chinese characters
> added by the HKSAR government that won't be in Unicode either.  So yeah,
> though I am bemused, I am kind of glad that the Chinese government take such
> a strong stance to force software support the new GB18030-2000 standard,
> which, like ISO-10646, has space for millions of characters.  :-)

ISO-10646 and Unicode share exactly the same character set and will 
do also in future, though the width of code space is different 
(ISO-10646: 31bit, Unicode: 0x000000 - 0x10ffff [a bit more than 
20bit] ).

I suppose you misunderstand that Unicode is 16bit, though it is true
that Unicode (1.0) _was_ 16bit.

Tomohiro KUBOTA <kubota@debian.org>

Reply to: