[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

some thoughts


Here are some of my thoughts, comments, and notes on the state of affairs
of Chinese in Debian Linux.  I hope this will engender some discussion and


Thomas Chan (Chen2 Kang1shi4)

Some thoughts on Debian, Linux and Chinese, 1999.6.19:

Character Set Support

1) GBK (aka GB 13000.1):
     General information:
     - 1993 standard from mainland China which extends GB2312 (1980)
     - same ~20,000 character repetoire as Unicode, as they were
       both developed in cooperation, according to the Unicode 2.0 book
     - codepoint for codepoint backwards compatible with GB2312 (in EUC-CN encoding)
     - one can write "Rong" in Premier Zhu Rongji's name with it :)
       e.g. http://www.newchinapicture.com/company/newchinapicture.com.cn/shizheng.html

     Microsoft and/or Windows-related:
     - Microsoft upgraded the definition of their Codepage 936 from GB2312 to GBK
       for the mainland China version of Win 95
     - GBK is also the basis for Pan-Chinese NT Workstation, released in
       Hong Kong, which is the mainland Chinese version of Win NT Workstation 4.0
       but with an English interface (?), and supports simplified and traditional
       Chinese because of that huge repetoire of characters
     - GBK fonts "MS Hei" and "MS Song" can be gotten from the Internet
       Explorer (ie31pkcn.exe) and Office 97 add-on packs (chssupp.exe)
     - third-party CJK-enabling add-on products for Windows seem to have started
       providing GBK support around 1997/1998 (?)

     Debian and/or Linux-related:
     - Debian only has GB2312 fonts, and the GB input methods are probably
       only geared towards GB2312; existing software may be hardcoded to GB2312
     - a GBK font could probably be made by remapping the 24x24 Big5+ font in the
       xfntbig5p-cmex24m package (stable/main/binary-i386/x11), but mainland
       China (gov't) is picky about glyph design, so we should get a proper font
       and input methods, preferably one that is approved with a certificate
       like ??? (I found one on the www.dynalab.com.hk site before, but can't find it now)

     - we're behind the world in this area! (high priority)

2) Big5+:
     General information:
     - 1997 standard from Taiwan which extends Big5 (1984)
     - expands Big5 to the same ~20,000 character repetoire as Unicode
     - codepoint for codepoint backwards compatible with Big5
     - CMEX http://www.cmex.org.tw/ has font, Cangjie/Zhuyin data, CNS 11643/GBK/Unicode
       mapping tables

     Microsoft and/or Windows-related:
     - the definition Codepage 950 is still Big5, but history has shown
       that Microsoft has expanded their codepages with backwards-compatible
       supersets.  e.g. Codepage 936 from GB2312 -> GBK and Codepage 949 from
       KS C 5607 in EUC-KR to an expanded version, but creation of Codepage 1361 for
       KS C 5607 in Johab encoding, which is not compatible.  Also the expansion
       of Codepage 1252 (aka Winlatin1, ANSI) to include the Euro currency symbol.
       It wouldn't be too unusual if they redefined Codepage 950 to Big5+, would it?
     - Twinbridge 4.98+ is the only product that supports Big5+ so far, according
       to Lunde's CJKV-IP book

     Debian and/or Linux-related:
       - Debian already has the xfntbig5p-xcmex24m font, ultimately created from
         the data at CMEX
       - Debian doesn't have any input methods for Big5+, and software may be
         hardcoded for Big5 ranges (I know at least cxterm is flawed here)
       - Cangjie and Zhuyin (and even Pinyin) input methods could be created from
         the data at CMEX

     - it'd be cool to be among the first OS's to support this fully

3) Big5 w/ GCCS
     General information:
     - GCCS (Government Chinese Character Set) is a 1995 standard from HKSAR
     - HK was buried in a mess of proprietary extensions to Big5, so GCCS was born
     - uses the user-defined regions of Big5 to add 3,049 characters for
       local use (names of places in HK), Cantonese dialectal characters,
       Japanese characters, and PRC simplified characters
     - ~1,500 of those 3,049 characters are not in Unicode
     - GCCS support required for products supplied to the government
     - Big5 w/ GCCS combination conflicts with codepoints for Big5+ :(
     - there is also a extension to GCCS by the HK Department of Judiciary (?)
     - ITSD http://www.info.gov.hk/gccs/ has fonts, Cangjie and Quick Cangjie data
     - DynaLab HK has info at http://www.dynalab.com.hk/font/gaigi.htm and sometimes
       interesting information in the News section at http://www.dynalab.com.hk/whatsnew.htm

     Microsoft and/or Windows-related: 
     - get the stuff from ITSD
     - Microsoft provides fonts for Pan-Chinese NT Workstation at

     Debian and/or Linux-related:
     - no support

     - we need this if we want Debian to do well in Hong Kong

4) Big5 w/ various proprietary extensions
     - We might want to provide fonts and input methods, such as for DynaLab HK A,
       Apple Daily online newspaper, HKUST, etc for legacy use.  But we should
       push those stragglers to GCCS.  We might also want to provide fonts and input
       methods for non-HK extensions like KuoChiao's, ETen's, etc.

5) GBK w/ GCCS
     General information:
     - just like 1) GBK, except the 1,500 characters in GCCS that are not in
       Unicode (and thus, GBK) are stuffed in there.

     Microsoft and/or Windows-related:
     - Microsoft provides fonts for Pan-Chinese NT Workstation at

     Debian and/or Linux-related
     - no support

     - Does it matter?  I think Hong Kong is predominantly a Big5 world.

6) Big5 w/ Big5e
     General information:
     - Big5e (Big5 extension) is very new, 1999 (?)
     - extends Big5 with 3954 characters, all from CNS 11643 planes 3 and 4
     - not as extreme as Big5+, and it looks incompatible with Big5+
     - CMEX http://www.cmex.org.tw/ provides fonts, Cangjie/Zhuyin data,
       mapping tables

     Microsoft and/or Windows-related:
     - get the stuff from CMEX

     Debian and/or Linux-related
     - no support

     - no opinion, except that it looks like Big5+ is really a better choice in
       the long run

7) CNS 11643
     General information:
     - ~50,000 characters
     - first two planes are almost synonymous with Big5
     - virtually dead becauase of Big5?
     - www.ifcss.org has a font from CBS

     Microsoft and/or Windows-related:
     - no support

     Debian and/or Linux-related:
     - planes 1 and 2 supported by cjk-latex and related packages
     - xemacs (mule) supports all seven planes, but input methods are lacking
     - no bundled fonts for all seven planes, although I suspect there are
       is a 40x40 in the figfonts-cjk package (for figlet, though); intlfonts-chinese
       omits them (I believe the same font) because of license uncertainty

     - it would be nice to have all those characters in X but may not be possible
       for technical reasons

     General information:
     - ~70,000 characters
     - it or its ANSI subset cousin is used for bibliographic purposes
     - www.ifcss.org has two fonts, one BDF, and one in a proprietary (?) format;
       the former is a 64x64; the latter is created by a company called JOIN

     - some third-party library CJK catalog terminals (RLIN?)

     Debian and/or Linux-related
     - no support

     - if anyone wants to use Linux to make a CJK library terminal...

Input Methods (IME/FEP)

IMEs should be separated from programs and shared:

- Different programs have different IME's, of differing levels of
  quality.  xcin has Zhuyin and Cangjie IME's that are in active
  development; cxterm has a lot of IME's but they are old and stale.
  yudit has only Cangjie and mule (in xemacs) has only ???.  Since yudit
  and mule are not put together by Chinese users (yudit by a Hungarian,
  mule by Japanese), their Chinese support is terrible.

- Multiple versions of IME's (the current situation) means they are not
  guaranteed to function the same way.  e.g., the Cangjie IME that comes
  with one program may have "commonly mistaken codes", while one that
  comes with another program might not.

- Data can be ported to more than one IME.  e.g., the same Mandarin
  pronounciation data can be used to generate Pinyin and Zhuyin IME's.
  (with small exceptions for mainland/taiwan differences)

Suggested packaging arrangement:

1) ime-zh-cn
   Contains IME's that a mainland China user would expect to have.
   e.g. Wubi, Wubizixing, Pinyin

2) ime-zh-tw
   Contains IME's that a Taiwan user would expect to have.
   e.g. Zhuyin, Cangjie

3) ime-fangyan                        
   Contains IME's for dialects and foreign languages.
       de facto "Cantonese Pinyin"
       Sidney Lau - used in the 70's language primers
       Yale - used by linguists and foreign language textbooks
       Jyutping - promoted by LSHK
       Dr. Liu Zinfad's Hagfa
     Southern Min
       on yomi

4) ime-other
   Contains less commonly IME's.  e.g., Wade-Giles for inputting
   Mandarin (some non-Chinese might still use this), dictionary
   indices ("Kangxi page 545, third character..."), 4 corner, etc.                        

For the naming, I avoided "gb" and "big5" because these may change
in the future (e.g., unicode).  I also avoided "jianti" and "fanti"
because GBK, Big5+, and Unicode can do both jianti and fanti.
(GB = jianti and Big5 = fanti are both no longer true.)  I also avoided
listing "chinese" anywhere, because one can use a character set/encoding
for other languages.  e.g., a hypothetical ime-ja package with a Pinyin IME
for EUC-JP, for people in Japan (this already exists in some third-party
products for Windows).

Some comments on programs, ideas about things to package/fix/create

cedict{b5,gb} - Any way to download updates/additions to the dictionary
without downloading it all over again in entirety?  i.e., patches.
Currently it is small (409K), but if it grows in the future like its
inspiration, the Japanese->English EDICT dictionary (I believe packaged
for Debian-JP), it could become very huge (EDICT is several megs at this

tcs - tcs is a character set/encoding converter from Plan9.  The Big5
support is for the erroneous "HKU standard"; this should be fixed.

xmbdfed - xmbdfed is a font editor by Mark Leisher.  It can handle HBF
fonts, but the Debian package does not include it because of licensing
problems with the HBF code (written by someone other than Leisher).

yudit - yudit is a Unicode-based multi-language editor.  Upstream author
needs help--he doesn't speak all those langauges it supports.

xemacs (mule) - Chinese support could be better too.

Liu Zindad's Hagfa input method - A Hakka input method.  Any Hakka
out there who'd use this?  Dylan Sung (in HK until July) has
information on it at http://ubik.virtual-pc.com/sapienti/hakintro.htm ,
but Prof. Liu himself would be a better source for upstream updates.

dates - Support for ROC year?

chinalanguage.com - a nice web-based zidian by Thomas Chin.  Chin
provides his data for download; perhaps some it might be appropriate
to package.

dynadoc - DynaDoc is a PDF-like format by DynaLab.  It's used by HKSAR
and perhaps also mainland China and Taiwan to publish government
and industry publications.  Supposedly it has built-in CJK support, including
GCCS.  A reader for this on Linux would be nice.

docs - Manuals, tutorials, etc not in Chinese, for Chinese language
students would be nice.  e.g., how to type in cangjie, pinyin, etc.

mtv - mtv is a VCD player.  Unofficial .deb's are available from
http://www.mpegtv.com/ .  Uses the XForms library.  I believe it would
be a "non-free".  Given the popularity of VCD's in Asia, it would be nice
if Debian could include one (there are questions on newsgroups about where
to get a VCD for Linux).

tools - It'd be nice to have tools to help users create their own input
methods, and convert amongst xcin/cxterm/twinbridge/windows/etc formats.
Support for user-defined extensions (EUDC, "end-user defined fonts", "gaiji")
would be nice too.  Also conversion utilities, both strict (per codepoint),
and between traditional<->simplified cultural standards (Office 2000 does the
latter, according to the ads), would be nice.

Hardocpy References

Lunde, Ken.  _CJKV Information Processing_.  O'Reilly.  Dec 1998.
  ISBN 1-56592-224-7.  (or his older edition, _Understanding Japanese
  Information Processing_ from 1993)

Meyer, Dirk.  "Dealing With Hong Kong Specific Characters".
  Multingual, vol. 9, issue 3 (April 1998), pp. 35-38.

Nadine Kano's _Developing International Software for Windows 95 and
  Windows NT_.  I think this Win 95/Win NT 3.51 book is now out of print.

Unicode Consortium.  _The Unicode Standard, Version 2.0_.  Addison-Wesley.
  1996.  ISBN 0-201-48345-9.

Reply to: