some thoughts

To: debian-chinese@lists.debian.org
Subject: some thoughts
From: Thomas Chan <thomas@atlas.datexx.com>
Date: Wed, 30 Jun 1999 16:25:18 -0400 (EDT)
Message-id: <[🔎] Pine.LNX.3.96.990630161858.11820A-100000@atlas.datexx.com>
Reply-to: Thomas Chan <tc31@cornell.edu>

Hi,

Here are some of my thoughts, comments, and notes on the state of affairs
of Chinese in Debian Linux. I hope this will engender some discussion and
comments.

Thanks,

Thomas Chan (Chen2 Kang1shi4)
tc31@cornell.edu

Some thoughts on Debian, Linux and Chinese, 1999.6.19:

Character Set Support
---------------------

1) GBK (aka GB 13000.1):
General information:
- 1993 standard from mainland China which extends GB2312 (1980)
- same ~20,000 character repetoire as Unicode, as they were
both developed in cooperation, according to the Unicode 2.0 book
- codepoint for codepoint backwards compatible with GB2312 (in EUC-CN encoding)
- one can write "Rong" in Premier Zhu Rongji's name with it :)
e.g. http://www.newchinapicture.com/company/newchinapicture.com.cn/shizheng.html

Microsoft and/or Windows-related:
- Microsoft upgraded the definition of their Codepage 936 from GB2312 to GBK
for the mainland China version of Win 95
- GBK is also the basis for Pan-Chinese NT Workstation, released in
Hong Kong, which is the mainland Chinese version of Win NT Workstation 4.0
but with an English interface (?), and supports simplified and traditional
Chinese because of that huge repetoire of characters
- GBK fonts "MS Hei" and "MS Song" can be gotten from the Internet
Explorer (ie31pkcn.exe) and Office 97 add-on packs (chssupp.exe)
- third-party CJK-enabling add-on products for Windows seem to have started
providing GBK support around 1997/1998 (?)

Debian and/or Linux-related:
- Debian only has GB2312 fonts, and the GB input methods are probably
only geared towards GB2312; existing software may be hardcoded to GB2312
coderanges
- a GBK font could probably be made by remapping the 24x24 Big5+ font in the
xfntbig5p-cmex24m package (stable/main/binary-i386/x11), but mainland
China (gov't) is picky about glyph design, so we should get a proper font
and input methods, preferably one that is approved with a certificate
like ??? (I found one on the www.dynalab.com.hk site before, but can't find it now)

Conclusions:
- we're behind the world in this area! (high priority)

2) Big5+:
General information:
- 1997 standard from Taiwan which extends Big5 (1984)
- expands Big5 to the same ~20,000 character repetoire as Unicode
- codepoint for codepoint backwards compatible with Big5
- CMEX http://www.cmex.org.tw/ has font, Cangjie/Zhuyin data, CNS 11643/GBK/Unicode
mapping tables

Microsoft and/or Windows-related:
- the definition Codepage 950 is still Big5, but history has shown
that Microsoft has expanded their codepages with backwards-compatible
supersets. e.g. Codepage 936 from GB2312 -> GBK and Codepage 949 from
KS C 5607 in EUC-KR to an expanded version, but creation of Codepage 1361 for
KS C 5607 in Johab encoding, which is not compatible. Also the expansion
of Codepage 1252 (aka Winlatin1, ANSI) to include the Euro currency symbol.
It wouldn't be too unusual if they redefined Codepage 950 to Big5+, would it?
- Twinbridge 4.98+ is the only product that supports Big5+ so far, according
to Lunde's CJKV-IP book

Debian and/or Linux-related:
- Debian already has the xfntbig5p-xcmex24m font, ultimately created from
the data at CMEX
- Debian doesn't have any input methods for Big5+, and software may be
hardcoded for Big5 ranges (I know at least cxterm is flawed here)
- Cangjie and Zhuyin (and even Pinyin) input methods could be created from
the data at CMEX

Conclusions:
- it'd be cool to be among the first OS's to support this fully

3) Big5 w/ GCCS
General information:
- GCCS (Government Chinese Character Set) is a 1995 standard from HKSAR
- HK was buried in a mess of proprietary extensions to Big5, so GCCS was born
- uses the user-defined regions of Big5 to add 3,049 characters for
local use (names of places in HK), Cantonese dialectal characters,
Japanese characters, and PRC simplified characters
- ~1,500 of those 3,049 characters are not in Unicode
- GCCS support required for products supplied to the government
- Big5 w/ GCCS combination conflicts with codepoints for Big5+ :(
- there is also a extension to GCCS by the HK Department of Judiciary (?)
- ITSD http://www.info.gov.hk/gccs/ has fonts, Cangjie and Quick Cangjie data
- DynaLab HK has info at http://www.dynalab.com.hk/font/gaigi.htm and sometimes
interesting information in the News section at http://www.dynalab.com.hk/whatsnew.htm

Microsoft and/or Windows-related:
- get the stuff from ITSD
- Microsoft provides fonts for Pan-Chinese NT Workstation at
http://microsoft.com/hk/pcntw/html/extras.htm

Debian and/or Linux-related:
- no support

Conclusions:
- we need this if we want Debian to do well in Hong Kong

4) Big5 w/ various proprietary extensions
- We might want to provide fonts and input methods, such as for DynaLab HK A,
Apple Daily online newspaper, HKUST, etc for legacy use. But we should
push those stragglers to GCCS. We might also want to provide fonts and input
methods for non-HK extensions like KuoChiao's, ETen's, etc.

5) GBK w/ GCCS
General information:
- just like 1) GBK, except the 1,500 characters in GCCS that are not in
Unicode (and thus, GBK) are stuffed in there.

Microsoft and/or Windows-related:
- Microsoft provides fonts for Pan-Chinese NT Workstation at
http://microsoft.com/hk/pcntw/html/extras.htm

Debian and/or Linux-related
- no support

Conclusions:
- Does it matter? I think Hong Kong is predominantly a Big5 world.

6) Big5 w/ Big5e
General information:
- Big5e (Big5 extension) is very new, 1999 (?)
- extends Big5 with 3954 characters, all from CNS 11643 planes 3 and 4
- not as extreme as Big5+, and it looks incompatible with Big5+
- CMEX http://www.cmex.org.tw/ provides fonts, Cangjie/Zhuyin data,
mapping tables

Microsoft and/or Windows-related:
- get the stuff from CMEX

Debian and/or Linux-related
- no support

Conclusions:
- no opinion, except that it looks like Big5+ is really a better choice in
the long run

7) CNS 11643
General information:
- ~50,000 characters
- first two planes are almost synonymous with Big5
- virtually dead becauase of Big5?
- www.ifcss.org has a font from CBS
http://www.ifcss.org/ftp-pub/software/fonts/cns/

Microsoft and/or Windows-related:
- no support

Debian and/or Linux-related:
- planes 1 and 2 supported by cjk-latex and related packages
- xemacs (mule) supports all seven planes, but input methods are lacking
- no bundled fonts for all seven planes, although I suspect there are
is a 40x40 in the figfonts-cjk package (for figlet, though); intlfonts-chinese
omits them (I believe the same font) because of license uncertainty

Conclusions:
- it would be nice to have all those characters in X but may not be possible
for technical reasons

8) CCCII
General information:
- ~70,000 characters
- it or its ANSI subset cousin is used for bibliographic purposes
- www.ifcss.org has two fonts, one BDF, and one in a proprietary (?) format;
the former is a 64x64; the latter is created by a company called JOIN
http://www.ifcss.org/ftp-pub/software/fonts/misc/bdf/

Microsoft-related:
- some third-party library CJK catalog terminals (RLIN?)

Debian and/or Linux-related
- no support

Conclusions:
- if anyone wants to use Linux to make a CJK library terminal...

Input Methods (IME/FEP)
----------------------

IMEs should be separated from programs and shared:

- Different programs have different IME's, of differing levels of
quality. xcin has Zhuyin and Cangjie IME's that are in active
development; cxterm has a lot of IME's but they are old and stale.
yudit has only Cangjie and mule (in xemacs) has only ???. Since yudit
and mule are not put together by Chinese users (yudit by a Hungarian,
mule by Japanese), their Chinese support is terrible.

- Multiple versions of IME's (the current situation) means they are not
guaranteed to function the same way. e.g., the Cangjie IME that comes
with one program may have "commonly mistaken codes", while one that
comes with another program might not.

- Data can be ported to more than one IME. e.g., the same Mandarin
pronounciation data can be used to generate Pinyin and Zhuyin IME's.
(with small exceptions for mainland/taiwan differences)

Suggested packaging arrangement:

1) ime-zh-cn
Contains IME's that a mainland China user would expect to have.
e.g. Wubi, Wubizixing, Pinyin

2) ime-zh-tw
Contains IME's that a Taiwan user would expect to have.
e.g. Zhuyin, Cangjie

3) ime-fangyan
Contains IME's for dialects and foreign languages.
e.g.
Cantonese
de facto "Cantonese Pinyin"
Sidney Lau - used in the 70's language primers
Yale - used by linguists and foreign language textbooks
Jyutping - promoted by LSHK
Hakka
Dr. Liu Zinfad's Hagfa
Southern Min
???
Japanese
on yomi

4) ime-other
Contains less commonly IME's. e.g., Wade-Giles for inputting
Mandarin (some non-Chinese might still use this), dictionary
indices ("Kangxi page 545, third character..."), 4 corner, etc.

For the naming, I avoided "gb" and "big5" because these may change
in the future (e.g., unicode). I also avoided "jianti" and "fanti"
because GBK, Big5+, and Unicode can do both jianti and fanti.
(GB = jianti and Big5 = fanti are both no longer true.) I also avoided
listing "chinese" anywhere, because one can use a character set/encoding
for other languages. e.g., a hypothetical ime-ja package with a Pinyin IME
for EUC-JP, for people in Japan (this already exists in some third-party
products for Windows).

Some comments on programs, ideas about things to package/fix/create
-------------------------------------------------------------------

cedict{b5,gb} - Any way to download updates/additions to the dictionary
without downloading it all over again in entirety? i.e., patches.
Currently it is small (409K), but if it grows in the future like its
inspiration, the Japanese->English EDICT dictionary (I believe packaged
for Debian-JP), it could become very huge (EDICT is several megs at this
point).

tcs - tcs is a character set/encoding converter from Plan9. The Big5
support is for the erroneous "HKU standard"; this should be fixed.

xmbdfed - xmbdfed is a font editor by Mark Leisher. It can handle HBF
fonts, but the Debian package does not include it because of licensing
problems with the HBF code (written by someone other than Leisher).

yudit - yudit is a Unicode-based multi-language editor. Upstream author
needs help--he doesn't speak all those langauges it supports.

xemacs (mule) - Chinese support could be better too.

Liu Zindad's Hagfa input method - A Hakka input method. Any Hakka
out there who'd use this? Dylan Sung (in HK until July) has
information on it at http://ubik.virtual-pc.com/sapienti/hakintro.htm ,
but Prof. Liu himself would be a better source for upstream updates.

dates - Support for ROC year?

chinalanguage.com - a nice web-based zidian by Thomas Chin. Chin
provides his data for download; perhaps some it might be appropriate
to package.

dynadoc - DynaDoc is a PDF-like format by DynaLab. It's used by HKSAR
and perhaps also mainland China and Taiwan to publish government
and industry publications. Supposedly it has built-in CJK support, including
GCCS. A reader for this on Linux would be nice.
http://www.dynalab.com.hk/internet/index.htm

docs - Manuals, tutorials, etc not in Chinese, for Chinese language
students would be nice. e.g., how to type in cangjie, pinyin, etc.

mtv - mtv is a VCD player. Unofficial .deb's are available from
http://www.mpegtv.com/ . Uses the XForms library. I believe it would
be a "non-free". Given the popularity of VCD's in Asia, it would be nice
if Debian could include one (there are questions on newsgroups about where
to get a VCD for Linux).

tools - It'd be nice to have tools to help users create their own input
methods, and convert amongst xcin/cxterm/twinbridge/windows/etc formats.
Support for user-defined extensions (EUDC, "end-user defined fonts", "gaiji")
would be nice too. Also conversion utilities, both strict (per codepoint),
and between traditional<->simplified cultural standards (Office 2000 does the
latter, according to the ads), would be nice.

Hardocpy References
-------------------

Lunde, Ken. _CJKV Information Processing_. O'Reilly. Dec 1998.
ISBN 1-56592-224-7. (or his older edition, _Understanding Japanese
Information Processing_ from 1993)

Meyer, Dirk. "Dealing With Hong Kong Specific Characters".
Multingual, vol. 9, issue 3 (April 1998), pp. 35-38.

Nadine Kano's _Developing International Software for Windows 95 and
Windows NT_. I think this Win 95/Win NT 3.51 book is now out of print.

Unicode Consortium. _The Unicode Standard, Version 2.0_. Addison-Wesley.
1996. ISBN 0-201-48345-9.

Reply to:

Prev by Date: Re: dselect-beginners translated!
Previous by thread: Re: dselect-beginners translated!
Index(es):
- Date
- Thread