[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: some thoughts



|Some thoughts on Debian, Linux and Chinese, 1999.6.19:
|
|Character Set Support
|---------------------
|
|1) GBK (aka GB 13000.1):
|     General information:
|     - 1993 standard from mainland China which extends GB2312 (1980)
|     - same ~20,000 character repetoire as Unicode, as they were
|       both developed in cooperation, according to the Unicode 2.0 book
|     - codepoint for codepoint backwards compatible with GB2312 (in EUC-CN encoding)
|     - one can write "Rong" in Premier Zhu Rongji's name with it :)
|       e.g. http://www.newchinapicture.com/company/newchinapicture.com.cn/shizheng.html
|
|     Microsoft and/or Windows-related:
|     - Microsoft upgraded the definition of their Codepage 936 from GB2312 to GBK
|       for the mainland China version of Win 95
|     - GBK is also the basis for Pan-Chinese NT Workstation, released in
|       Hong Kong, which is the mainland Chinese version of Win NT Workstation 4.0
|       but with an English interface (?), and supports simplified and traditional
|       Chinese because of that huge repetoire of characters
|     - GBK fonts "MS Hei" and "MS Song" can be gotten from the Internet
|       Explorer (ie31pkcn.exe) and Office 97 add-on packs (chssupp.exe)
|     - third-party CJK-enabling add-on products for Windows seem to have started
|       providing GBK support around 1997/1998 (?)
|
|     Debian and/or Linux-related:
|     - Debian only has GB2312 fonts, and the GB input methods are probably
|       only geared towards GB2312; existing software may be hardcoded to GB2312
|       coderanges
|     - a GBK font could probably be made by remapping the 24x24 Big5+ font in the
|       xfntbig5p-cmex24m package (stable/main/binary-i386/x11), but mainland
|       China (gov't) is picky about glyph design, so we should get a proper font
|       and input methods, preferably one that is approved with a certificate
|       like ??? (I found one on the www.dynalab.com.hk site before, but can't find it now)
|
|     Conclusions:
|     - we're behind the world in this area! (high priority)

IHMO the major problem for the GBK support is the lack of font. (actually,
this is true to any encoding). Unluckily this problem is very
difficult to solve because DFSG-free fonts are very rare. But most people
lack of the skill to make them. Software support of GBK should not be
difficult to fix though.

|2) Big5+:
|     General information:
|     - 1997 standard from Taiwan which extends Big5 (1984)
|     - expands Big5 to the same ~20,000 character repetoire as Unicode
|     - codepoint for codepoint backwards compatible with Big5
|     - CMEX http://www.cmex.org.tw/ has font, Cangjie/Zhuyin data, CNS 11643/GBK/Unicode
|       mapping tables
|
|     Microsoft and/or Windows-related:
|     - the definition Codepage 950 is still Big5, but history has shown
|       that Microsoft has expanded their codepages with backwards-compatible
|       supersets.  e.g. Codepage 936 from GB2312 -> GBK and Codepage 949 from
|       KS C 5607 in EUC-KR to an expanded version, but creation of Codepage 1361 for
|       KS C 5607 in Johab encoding, which is not compatible.  Also the expansion
|       of Codepage 1252 (aka Winlatin1, ANSI) to include the Euro currency symbol.
|       It wouldn't be too unusual if they redefined Codepage 950 to Big5+, would it?
|     - Twinbridge 4.98+ is the only product that supports Big5+ so far, according
|       to Lunde's CJKV-IP book
|
|     Debian and/or Linux-related:
|       - Debian already has the xfntbig5p-xcmex24m font, ultimately created from
|         the data at CMEX
|       - Debian doesn't have any input methods for Big5+, and software may be
|         hardcoded for Big5 ranges (I know at least cxterm is flawed here)
|       - Cangjie and Zhuyin (and even Pinyin) input methods could be created from
|         the data at CMEX
|
|     Conclusions:         
|     - it'd be cool to be among the first OS's to support this fully

I think we should pursue this. Do you know what kinds of characters
has been added into Big5+? (Sigh, all the documents from CMEX are in
Word 7.0...)

|
|3) Big5 w/ GCCS
|     General information:
|     - GCCS (Government Chinese Character Set) is a 1995 standard from HKSAR
|     - HK was buried in a mess of proprietary extensions to Big5, so GCCS was born
|     - uses the user-defined regions of Big5 to add 3,049 characters for
|       local use (names of places in HK), Cantonese dialectal characters,
|       Japanese characters, and PRC simplified characters
|     - ~1,500 of those 3,049 characters are not in Unicode
|     - GCCS support required for products supplied to the government
|     - Big5 w/ GCCS combination conflicts with codepoints for Big5+ :(
|     - there is also a extension to GCCS by the HK Department of Judiciary (?)
|     - ITSD http://www.info.gov.hk/gccs/ has fonts, Cangjie and Quick Cangjie data
|     - DynaLab HK has info at http://www.dynalab.com.hk/font/gaigi.htm and sometimes
|       interesting information in the News section at http://www.dynalab.com.hk/whatsnew.htm
|
|     Microsoft and/or Windows-related: 
|     - get the stuff from ITSD
|     - Microsoft provides fonts for Pan-Chinese NT Workstation at
|       http://microsoft.com/hk/pcntw/html/extras.htm  
|
|     Debian and/or Linux-related:
|     - no support
|
|     Conclusions:
|     - we need this if we want Debian to do well in Hong Kong

Agree (as I'm a Hongkonger :)

|
|4) Big5 w/ various proprietary extensions
|     - We might want to provide fonts and input methods, such as for DynaLab HK A,
|       Apple Daily online newspaper, HKUST, etc for legacy use.  But we should
|       push those stragglers to GCCS.  We might also want to provide fonts and input
|       methods for non-HK extensions like KuoChiao's, ETen's, etc.
|
|5) GBK w/ GCCS
|     General information:
|     - just like 1) GBK, except the 1,500 characters in GCCS that are not in
|       Unicode (and thus, GBK) are stuffed in there.
|
|     Microsoft and/or Windows-related:
|     - Microsoft provides fonts for Pan-Chinese NT Workstation at
|       http://microsoft.com/hk/pcntw/html/extras.htm  
|
|     Debian and/or Linux-related
|     - no support
|
|     Conclusions:
|     - Does it matter?  I think Hong Kong is predominantly a Big5 world.

Yes, Hongkong uses Big5 only.

|6) Big5 w/ Big5e
|     General information:
|     - Big5e (Big5 extension) is very new, 1999 (?)
|     - extends Big5 with 3954 characters, all from CNS 11643 planes 3 and 4
|     - not as extreme as Big5+, and it looks incompatible with Big5+
|     - CMEX http://www.cmex.org.tw/ provides fonts, Cangjie/Zhuyin data,
|       mapping tables
|
|     Microsoft and/or Windows-related:
|     - get the stuff from CMEX
|
|     Debian and/or Linux-related
|     - no support
|
|     Conclusions:
|     - no opinion, except that it looks like Big5+ is really a better choice in
|       the long run
|
|7) CNS 11643
|     General information:
|     - ~50,000 characters
|     - first two planes are almost synonymous with Big5
|     - virtually dead becauase of Big5?
|     - www.ifcss.org has a font from CBS
|       http://www.ifcss.org/ftp-pub/software/fonts/cns/
|
|     Microsoft and/or Windows-related:
|     - no support
|
|     Debian and/or Linux-related:
|     - planes 1 and 2 supported by cjk-latex and related packages
|     - xemacs (mule) supports all seven planes, but input methods are lacking
|     - no bundled fonts for all seven planes, although I suspect there are
|       is a 40x40 in the figfonts-cjk package (for figlet, though); intlfonts-chinese
|       omits them (I believe the same font) because of license uncertainty
|
|     Conclusions:
|     - it would be nice to have all those characters in X but may not be possible
|       for technical reasons

I have heard that those fonts have some copyright issues that prevents
us put them in Debian. I believe that fonts in figfonts-cjk just
'slipped through the holes' and no one noticed that (or no one care).

|
|8) CCCII
|     General information:
|     - ~70,000 characters
|     - it or its ANSI subset cousin is used for bibliographic purposes
|     - www.ifcss.org has two fonts, one BDF, and one in a proprietary (?) format;
|       the former is a 64x64; the latter is created by a company called JOIN
|       http://www.ifcss.org/ftp-pub/software/fonts/misc/bdf/ 
|
|     Microsoft-related:
|     - some third-party library CJK catalog terminals (RLIN?)
|
|     Debian and/or Linux-related
|     - no support
|
|     Conclusions:
|     - if anyone wants to use Linux to make a CJK library terminal...

I'm not sure on this as I seldom come across CCCII. This can be of lower
priority.

|
|Input Methods (IME/FEP)
|----------------------
|
|IMEs should be separated from programs and shared:
|
|- Different programs have different IME's, of differing levels of
|  quality.  xcin has Zhuyin and Cangjie IME's that are in active
|  development; cxterm has a lot of IME's but they are old and stale.
|  yudit has only Cangjie and mule (in xemacs) has only ???.  Since yudit
|  and mule are not put together by Chinese users (yudit by a Hungarian,
|  mule by Japanese), their Chinese support is terrible.
|
|- Multiple versions of IME's (the current situation) means they are not
|  guaranteed to function the same way.  e.g., the Cangjie IME that comes
|  with one program may have "commonly mistaken codes", while one that
|  comes with another program might not.
|
|- Data can be ported to more than one IME.  e.g., the same Mandarin
|  pronounciation data can be used to generate Pinyin and Zhuyin IME's.
|  (with small exceptions for mainland/taiwan differences)
|
|Suggested packaging arrangement:
|
|1) ime-zh-cn
|   Contains IME's that a mainland China user would expect to have.
|   e.g. Wubi, Wubizixing, Pinyin
|
|2) ime-zh-tw
|   Contains IME's that a Taiwan user would expect to have.
|   e.g. Zhuyin, Cangjie
|
|3) ime-fangyan                        
|   Contains IME's for dialects and foreign languages.
|   e.g.
|     Cantonese
|       de facto "Cantonese Pinyin"
|       Sidney Lau - used in the 70's language primers
|       Yale - used by linguists and foreign language textbooks
|       Jyutping - promoted by LSHK
|     Hakka
|       Dr. Liu Zinfad's Hagfa
|     Southern Min
|       ???
|     Japanese
|       on yomi
|
|4) ime-other
|   Contains less commonly IME's.  e.g., Wade-Giles for inputting
|   Mandarin (some non-Chinese might still use this), dictionary
|   indices ("Kangxi page 545, third character..."), 4 corner, etc.                        
|
|For the naming, I avoided "gb" and "big5" because these may change
|in the future (e.g., unicode).  I also avoided "jianti" and "fanti"
|because GBK, Big5+, and Unicode can do both jianti and fanti.
|(GB = jianti and Big5 = fanti are both no longer true.)  I also avoided
|listing "chinese" anywhere, because one can use a character set/encoding
|for other languages.  e.g., a hypothetical ime-ja package with a Pinyin IME
|for EUC-JP, for people in Japan (this already exists in some third-party
|products for Windows).

However, the fact is that the data format of input methods that
different programs use are not the same. Cxterm uses it own formats,
and xcin uses another one. The best thing we can do is to make a
centralized repository of 'raw' input method data, and suggest authors
of Chinese software, like cxterm and xcin, to refer to our repository.

|
|Some comments on programs, ideas about things to package/fix/create
|-------------------------------------------------------------------
|
|cedict{b5,gb} - Any way to download updates/additions to the dictionary
|without downloading it all over again in entirety?  i.e., patches.
|Currently it is small (409K), but if it grows in the future like its
|inspiration, the Japanese->English EDICT dictionary (I believe packaged
|for Debian-JP), it could become very huge (EDICT is several megs at this
|point).

This is a very interesting point. Not only cedict, but also any other
large packages like xfonts. This involves modifying the current
Debian's upload/download and ftp infrastructure, but it seems this is
very useful. This will take some time to accomplish.

|tcs - tcs is a character set/encoding converter from Plan9.  The Big5
|support is for the erroneous "HKU standard"; this should be fixed.

I tried tcs before. It had problems converting GB to and from Big5. I
hope I did nothing wrong at that time. Have you tried ccf? It's not
packaged probably due to license problem, but it can do the conversion
very well. I frequently use ccf to do GB/Big5/HZ conversion.

|
|xmbdfed - xmbdfed is a font editor by Mark Leisher.  It can handle HBF
|fonts, but the Debian package does not include it because of licensing
|problems with the HBF code (written by someone other than Leisher).

Do you know of any way to convert between HBF and BDF? If no then may
be we should patch xmbdfed to support HBF. As the format of HBF is
open, I think it can be done.

|
|yudit - yudit is a Unicode-based multi-language editor.  Upstream author
|needs help--he doesn't speak all those langauges it supports.

Wow, authors of yudit are quite capable. 'Helping yudit' should be an
wishlist item in our TODO list.

|
|xemacs (mule) - Chinese support could be better too.

Can you be more specific? I don't use mule so I have no idea...

|
|Liu Zindad's Hagfa input method - A Hakka input method.  Any Hakka
|out there who'd use this?  Dylan Sung (in HK until July) has
|information on it at http://ubik.virtual-pc.com/sapienti/hakintro.htm ,
|but Prof. Liu himself would be a better source for upstream updates.

haha, great, another wishlist item.

|dates - Support for ROC year?

May not be necessary as I can see, the demand is not very large.

|chinalanguage.com - a nice web-based zidian by Thomas Chin.  Chin
|provides his data for download; perhaps some it might be appropriate
|to package.

I have taken a look and there may be something that we can use, I will
take a few more visits to see what are actually useful.

|
|dynadoc - DynaDoc is a PDF-like format by DynaLab.  It's used by HKSAR
|and perhaps also mainland China and Taiwan to publish government
|and industry publications.  Supposedly it has built-in CJK support, including
|GCCS.  A reader for this on Linux would be nice.
|http://www.dynalab.com.hk/internet/index.htm

I can't find the DynaDoc format from DynaLab's website. This makes us
very difficult to create a Linux reader, because we have to reverse
engineering the DynaDoc format.


|docs - Manuals, tutorials, etc not in Chinese, for Chinese language
|students would be nice.  e.g., how to type in cangjie, pinyin, etc.

One more wishlist item :)

|mtv - mtv is a VCD player.  Unofficial .deb's are available from
|http://www.mpegtv.com/ .  Uses the XForms library.  I believe it would
|be a "non-free".  Given the popularity of VCD's in Asia, it would be nice
|if Debian could include one (there are questions on newsgroups about where
|to get a VCD for Linux).

Yes, VCD is popular. However on their homepage it doesn't mention that
we can re-distribute the program. I need to send them an email for
clarification. It would be cool if we have a VCD player :)

|
|tools - It'd be nice to have tools to help users create their own input
|methods, and convert amongst xcin/cxterm/twinbridge/windows/etc formats.
|Support for user-defined extensions (EUDC, "end-user defined fonts", "gaiji")
|would be nice too.  Also conversion utilities, both strict (per codepoint),
|and between traditional<->simplified cultural standards (Office 2000 does the
|latter, according to the ads), would be nice.

It would be nice to our users if we have those input method widgets.
As to the conversion utilities you mentioned, are you talking about
the tools like tcs that convert between different encodings?

|
|Hardocpy References
|-------------------
|
|Lunde, Ken.  _CJKV Information Processing_.  O'Reilly.  Dec 1998.
|  ISBN 1-56592-224-7.  (or his older edition, _Understanding Japanese
|  Information Processing_ from 1993)

I like this book indeed, I learned a lot from it.

|
|Meyer, Dirk.  "Dealing With Hong Kong Specific Characters".
|  Multingual, vol. 9, issue 3 (April 1998), pp. 35-38.
|
|Nadine Kano's _Developing International Software for Windows 95 and
|  Windows NT_.  I think this Win 95/Win NT 3.51 book is now out of print.
|
|Unicode Consortium.  _The Unicode Standard, Version 2.0_.  Addison-Wesley.
|  1996.  ISBN 0-201-48345-9.

-- 
Anthony Wong.   [ E-mail: hajime@asunaro.dhs.org / ypwong@debian.org ]


Reply to: