[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: GB18030 support in Mozilla (fwd)



Hi all,

This may be of interest to someone.  See Whistler's comments below on
GB18030 mapping table problems.


THomas Chan
tc31@cornell.edu


---------- Forwarded message ----------
Date: Sun, 12 Nov 2000 16:07:49 -0800
From: Katsuhiko Momoi <momoi@netscape.com>
To: mozilla-i18n@mozilla.org
Cc: webchina@dsl-only.net
Subject: Re: GB18030 support in Mozilla
Resent-Date: Sun, 12 Nov 2000 16:09:08 -0800 (PST)
Resent-From: mozilla-i18n@mozilla.org

Yueheng,

We need to resolve some issues concerning GB18030 first. Questions have
been raised by knowledgeable people about the details of this standard.
Please consult the following 2 messages for more information. Frank Tang
is on vacation now and we want him to participate in this discussion
also. The link to the GB18030 info file in English (in PDF format)
appears in the first message:

- Kat

=====================
Message 1:

-------- Original Message --------
Subject: GB18030 summary and issues
Date: Fri, 13 Oct 2000 09:57:00 -0800 (GMT-0800)
From: Markus Scherer <markus.scherer@jtcsv.com>
To: "Unicode List" <unicode@unicode.org>

Dear Uni-encoders and -decoders,

Dirk Meyer from Adobe has put together an extensive summary of the
chinese GB 18030 encoding standard that was published on 2000-mar-17.
Ken Lunde and I assisted Dirk with reviews and comments.

The summary is on the web site of Ken's famous CJKV book "with the
fish":

ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf

To summarize the summary, we now have an english text describing the new
encoding in its details. There are a few apparent errors, typos, and
inconsistencies in the chinese standard text that need to be resolved.

For implementers, there is enough information in the summary to describe
the encoding structure and to prepare an implementation.

What is still missing - aside from the resolution of the issues
mentioned here - is a precise mapping table for how to map between at
least the one-byte and two-byte portions of GB 18030 to and from
Unicode.
In theory, it should be almost the same as GBK, but to be sure, we need
precise, complete, and machine-readable mappings.
Given the one-byte and two-byte portions and the description in the
standard and in the summary, the four-byte portion can be derived with a
little bit of Perl or similar.

Anyone who needs to implement or know about GB 18030 should probably
read this text.

Anyone who can contribute precise mapping tables and/or can help
resolving the open issues please do so.


Best regards,

markus

=======================================
Message 2:

-------- Original Message --------
Subject: [li18nux:753] Fwd: RE: GB18030 summary and issues
Resent-Date: Thu, 19 Oct 2000 20:01:09 -0700 (PDT)
Resent-From: linux-i18n@netscape.com
Date: Fri, 20 Oct 2000 11:56:55 +0900
From: "Martin J. Duerst" <duerst@w3.org>
Reply-To: li18nux@li18nux.org
To: li18nux@li18nux.org

With the permission of the author, I'm sending you a comment on
the GB18030 mapping table that have appeared on this list
some time ago.

Regards,   Martin.


>X-UML-Sequence: 5977 (2000-10-17 00:36:44 GMT)
>From: Kenneth Whistler <kenw@sybase.com>

>Date: Mon, 16 Oct 2000 16:36:41 -0800 (GMT-0800)
>Subject: RE: GB18030 summary and issues

>I've taken a look at the GB18030.TXT you provided, and unfortunately,
>as it stands, the mapping table has *major* problems.
>
>Most of these problems really derive from the serious flaws in GB 18030-2000
>itself, so I'm not sure exactly what implementers are going to
>do about them, but so you can focus in on the issues, here is some
>of what I turned up.
>
>A. GB 18030's encoding and mapping of Annex B (p. 91) -- ideographic
>variation indicator, and the ideographic description characters, is
>flat-out wrong. The same thing applies to Annex C (p. 92), the CJK
>radicals supplement. Essentially, the relevant Chinese committee rushed this
>thing to publication without having determined where these characters
>were encoded in 10646, *despite* the fact that GB 18030 then makes
>normative mappings to the entirety of 10646-1:2000 (actually to
>GB 13000.1, but that is just a pointer to 10646-1:2000, unless they
>printed *that* wrong, too, in which case we are even more screwed up).
>The result is just out-and-out errors. To wit:
>
>   1. U+303E (GB18030 A989) is mapped to U+E7E7 (user-defined)
>
>The net result in GB18030.TXT is that GB A989 is mapped into private use,
>even though in the chart it is shown as U+303E. But U+303E, as a *code
>position*, is mapped to the 4-byte form 0x8139A634.
>
>   2. U+2FF0..U+2FFB (GB18030 A98A..A995) are mapped correctly in the
>      main tables of GB18030 (p. 82), but are mapped again incorrectly
>      in Annex C (U+E7E8..U+E7F3, user-defined).
>
>The net result in GB18030.TXT is that all the ideographic description
>characters are double-mapped.
>
>   3. U+2E80..2EF3, the CJK radicals supplement, are mapped haphazardly,
>      from an earlier draft, apparently: GB18030 FE50..FEA0 is mapped
>      to U+E815..U+E864, instead of the actual Unicode code points. In
>      addition, some of the characters in Annex C, are actually in
>      Vertical Extension A, resulting in gapping in the tables.
>
>The net result in GB18030.TXT is that all the CJK radicals and
>other characters in Annex C are double-mapped.
>
>B. GB 18030 makes the mistake of trying to encode all code positions
>in GB 13000.1 (= 10646-1:2000), regardless of their status. That
>means, among other things, that all private use code positions
>in Unicode on the BMP are given GB 18030 code assignments --
>*regardless* of their status in GB 18030 as assigned characters or
>not. This makes a complete hash, compounded by the fact that all the
>characters mentioned in A above are erroneously assigned to private
>use codes in Unicode. That renders the mapping of the rest of user
>space trash.
>
>C. As an extension of B., GB 18030 also maps surrogate code positions
>to GB 18030 4-byte codes, *as if* they were characters. Thus U+D800
>(a surrogate code point, not an unassigned character) is mapped to
>0x8336C739, indifferently from U+D7FF (an unassigned character
>position) being mapped to 0x8336C738.
>
>Incidentally, there appears to be an off-by-one error in this area in
>GB18030.TXT as well: GB18030.TXT shows 0x8336c830 = U+D800, whereas
>the printed text of the GB18030-2000 standard itself shows
>0x8336C739 = U+D800.
>
>I'm not sure what the solution here is, other than to encourage China
>to fix its $@&#*^! standard. But if the tables you posted have in
>fact already been rolled out in Linux implementations in China, then
>we are all going to have to live with horrendous interoperability
>problems resulting from bad mapping tables for bad standards.
>
>Here it is the year 2000, and having lived with the yen/backslash
>problem and the fullwidth tilde problem, and the not sign problem for
>decades in East Asian implementations, I guess everybody has decided
>that we should start off the new century with a brand-spanking new
>set of ways to shoot ourselves in both feet at the same time for
>Chinese implementations.
>
>--Ken

========================= End of 2 messages quoted ====================


Yueheng Xu wrote:
> 
> The manditory Chinese national standards of GB18030 is comming by the
> end of 2000. Do we (the Mozilla community) have any plans to add that
> support in our browser ?
> 
> Currently the largest Chinese character set we supported in mozilla is
> GBK which
> has a little over 20,000 characters.
> 
> The new GB18030 is a super set of it and has about 27,000 characters. It
> contains
> one byte, two byte and three byte characters.
> 
> The GB18030 is a manditory standards to be effective by the end of 2000.
> After that date, no information system that do not support GB18030 is
> not allowed to be
> marketed in China.
> 
> WithGB13030, all the simpliefied Chinese, traditional Chinese, all the
> characters available in GB2312, GBK, BIG5 etc are included as a subset
> and the Gb13080
> is backward compatibel with GB2312  (and possibly also GBK ?).  I don't
> have
> a character set table with me.
> 
> If any one can send me a GB18030, I can fidn time to add the support of
> it in
> Mozilla.
> 
> Yueheng Xu,
> CEO
> Network 2000, Inc.
> http://www.n2k.net
> email: webchina@dsl-only.net

-- 
Katsuhiko Momoi
Netscape International Client Products Group
momoi@netscape.com

What is expressed here is my personal opinion and does not reflect 
official Netscape views.




Reply to: