关于中文locale中汉字的排序

To: debian-chinese-gb@lists.debian.org
Subject: 关于中文locale中汉字的排序
From: rigel <rigel863@yahoo.com>
Date: Mon, 11 Sep 2000 02:31:32 -0400
Message-id: <[🔎] 20000911023132.A2847@yahoo.com>
Mail-followup-to: rigel <rigel863@yahoo.com>, debian-chinese-gb@lists.debian.org
Reply-to: rigel <rigel863@yahoo.com>

大家好, 我是glibc2.2中文lcoale zh_CN的作者. 目前我正在准备改版, 想问问大
家对于汉字collation的想法. 我知道这不是debian专门的问题, 但locale作为i18n
和l10n的基础, 对debian-chinese有很大影响, 而且这里云集了两岸三地暨海外的
linux高手, 我渴望听到各位的意见. 希望大家不介意这个slightly off-topic post.

先非常简单的介绍一下glibc2.2. 目前还处于beta阶段. 与国际化有关的最显著
的变化包括重写locale系统和实现了widechar I/O函数. 新的locale基于ISO 14652,
localedef程序支持多字节字符, locale定义文件与charmap独立, 即只有一个zh_CN,
不再需要分别的zh_CN.GB2312, zh_CN.GBK等定义文件.

Collate是locale中的一个category, 用来规定字串比较时字符的顺序. 我们关心的
当然是汉字的顺序. 一般不外采用部首+笔画排序或拼音排序. 在此之前的中文lcoale
定义, 包括hashao兄的zh_CN.GB2312, 陈向阳兄的zh_CN.GBK和谢东翰兄的zh_TW.Big5,
都直接采用按编码(encoding)排序. 因为旧版glibc对中文的支持有限, collate基本
没用. 现在我们终于有了差不多彻底支持i18n的glibc, 也许有必要认真考率汉字
collation的问题.

几经斟酌, 我最终决定使用部首+笔画排序, 原因如下:

1. Unicode中的汉字是采用部首+笔画排序的, 部首+笔画的顺序就是Unicode编码的顺序.
目前locale定义文件中的编码全用Unicode, 这样就不需将27000汉字一一列出, 文件
简捷, 易于维护.
2. 在Unicode的官方文件中, 大约有7000汉字没有拼音. 这些汉字都是冷僻字或来自日韩
的汉字, 由我们自己赋予拼音, 是件浩大的工程, 几近不可能. 这是我放弃拼音排序
的最主要原因. (如果您知道更全且据权威性的mapping table, 请告我).
3. 由于多音字的存在, 我对拼音排序是否有意义有疑惑. 因为我们不能像字典一个字
出现在多页, 这里一个字只能有一个位置.
4. 我们中文目前有zh_CN, zh_TW和zh_HK三个locales, 由于地区文化的差异, 这是不
可避免的. 但我私心以为在能够相同的地方, 我们应尽量努力不使其不同. 采用部
首+笔画排序应该是大家都可以接收的, 这样我们就有了相同的collation. (很高兴
看到谢东翰兄在新的zh_TW中采纳了相同的排序).
5. 使用笔画排序的collate并不是说在程序内部就不能使用拼音对字串进行排序. 程序
员可以作任何事情! 只有strcoll函数使用collate, 大家也许没用过strcoll, 但绝
对用过 strcmp, 想想是不是在绝大多数情况下都是这麽用的:
if ( strcmp(str1, str2) == 0 )
...
else
...
我们只关心两个串是否相同, 很少有需要给它们排个序. 从这意义来讲, 欧洲语言
更需要collate, 因为否则没法知道比如"ll"与"l"相等(西班牙语). 中文所有不同的
汉字都是不等的, 无论如何排序都问题不大. 我不认为任何特定的排序会对编程有
特别的帮助. 最终用户看不到collate, 所以也不会去关心.
如果您觉得用拼音排序对简体汉字编程有什么好处, 比如易写易维护, 显著改善性能
等, 我将很乐於看到您给个例子.

写得太长了, 就此打住, 希望各位发表高见. 同时也希望大家帮助测试glibc2.2及中文
locale. 不知debian是否有beta/alpha/whatever版包括glibc2.2的?

最后附上最近与hashao兄对此问题讨论的email往还, 感谢hashao兄的讨论, 让我认真
想了很多问题.

Rigel

ha> Regarding the zh_CN definition, could you make the hanzi part of
ha>the collate follow the pinyin order? Both ja_JP and Ko_KR use their own
ha>collate instead of the iso10651_lt. The iso10651_lt is pretty useless
ha>for hanzi.
ha>
ha> A pinyin table and a script should do the job.
ha>
ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a
ha>Uni2Pinyin.gz file. But it is kinda old (1996).

ri> The hanzi collation sequence of glibc 2.2 zh_CN locale is in hanzi's
ri> stroke-count order. That's how CJK unified ideographs are arranged in
ri> Unicode, if I'm not mistaken. For a while, I was tossed between using
ri> stroke-count order and pinyin order when developed the locale. Eventually
ri> the decision was made in favoring of stroke-count order because of the
ri> following considerations:

ri> 1. The mapping between hanzi and pinyin is incomplete in Unicode 3.0.
ri> By that I mean not every hanzi has a pinyin associate with.
ri> Specifically, none of the 6582 from newly added CJK unified
ri> ideographs extension A has pinyin. Even for the long existing CJK
ri> unified ideographs, 558 of them do not have pinyin either. The
ri> table (Uni2Pinyin) you mentioned does not help much here since it
ri> was based on Unicode 1.1. I'm also reluctant to accept any ad hoc
ri> mapping tables, and prefer those from international or
ri> national standard bodies, or credible research institutes.

ha> Fair enough. But I think we can make the pinyin table
ha> complete. We can always update the data when there are
ha> some official standards published.

ri> Did you realize how BIG the job is to assign pinyin to ~7000
ri> obscured hanzi? And how do you suggest we do with the collation
ri> BEFORE we have them complete?

ri> 2. A fair number of hanzi have multiple pronunciations. This certainly
ri> will invite ambiguity and debate as which one should be picked as the
ri> primary pinyin.

ha> Well, I would guess that most hanzi that had multiple pronunciations
ha> are frequently used ones. Frequently used hanzi are sorted in GB2312
ha> according to pinyin. Pinyin for hanzi with multiple pronunciations are
ha> decided by the most frequently used pronunciation for the hanzi. I
ha> find is solution is clean and simple. So a map to the GB2312 can
ha> be used when there are ambiguity.

ri> Well, I see the difference here is that you are satisfied with
ri> "most", while I'm looking for a solution for "all". What do you
ri> suggest to do if there is one case that can no be mapped to
ri> GB2312? What about 2 cases? 3 cases? ..., and I can guarantee you
ri> that there are plenty.

ri> Another problem with the existence of multi-pronunciation is that
ri> the programmers can not reliably depend on the collation based
ri> sorting. Because one can not assume which pronunciation the user
ri> intended to. Going with the most frequent pronunciation is not a
ri> solution, because sometimes user might indeed look for a rarely
ri> used pronunciation.

ri> 3. Arranging hanzi in stroke-count order is a very natural way and has
ri> long history of acceptance in China. Since it also happen to be
ri> coincident with Unicode code value order, we can rank hanzi simply by
ri> their Unicode value which is another tradition started by ASCII. This
ri> allows us to write a concise locale definition file, instead of
ri> enumerating each every one code (note, that's more than 27000 of them).

ha> The stroke-count order has long history of acceptance in China is
ha> related to that there was no pronunciation standard and a symbolic
ha> system to represent the pronunciation. (Okey, I don't know much about
ha> it. Just my personal impression.) Even since we have a standard pinyin
ha> system, we are more familiar with the pinyin ordering system. And it
ha> is natural for our language, I would say. Stroke-count surely looks
ha> natural for ideograph. but when we think about Chinese, we more likely
ha> think how it sounds (well, except those wubi input freaks. hehe).
ha> That is why we adopt pinyin index system on most Chinese dictionaries.
ha> Yes, there are dictionaries like ci2hai3 use stroke-count as main
ha> index system. But everyday Joe won't use ci2hai3 that often.
ha>
ha> When I look at a sorted list of hanzi, I would find that is quicker
ha> to jump to a hanzi term by using its pronunciation than using its
ha> stroke-count.

ri> Using whatever collation sequence does not prohibit sorting the
ri> hanzi strings using pinyin order. It's programmer's choice and
ri> has nothing to do with collation.

ha> As for ASCII analog, I would think that is why there are collation
ha> table in locale definition. Applications should use locale collation
ha> rules instead of the code value order to sort things. Speed and
ha> size are not problems as hardwares today have no problem to coup with
ha> it. (That remain me at the 386 era, decoding jpeg is such a cpu hungry
ha> task that speed is the only thing programs like qpeg, sea advertised.
ha> And people buy that! Now a day, no one cares how fast you can decode a
ha> jpeg file.) If you were talking about effort to make a Chinese
ha> collation table with 27000 entries, yes, it is a much more than just
ha> use unicode code value scheme.

ri> That is absolutely NOT true. Applications have every rights to use
ri> whatever rules they see fit to do the sorting. No one force you
ri> to use locale collation. As a matter of fact, among all the c
ri> library functions, only four use locale collation: strcoll,
ri> strxfrm, wcscoll, and wcsxrfm, and the later two are simply the
ri> widechar counterpart of the first two.

ri> 4. Pinyin order might not make much sense to people from Taiwen, at
ri> least for now. Using two different collation sequences for hanzi in
ri> zh_CN and zh_TW is, in my opinion, a disaster. That will make
ri> programmer's life miserable. Now that we have a chance to adopt the
ri> same one, why let it escape?!

ha> It is true that two zh_CN and zh_TW are not very applaudable. However,
ha> we have to face the reality that there are differences in cultures
ha> here. Different ways for monetary notion, different way in calender
ha> notion. For stroke-count, traditional hanzi and simplified hanzi have
ha> totally different and sometime inconsistent stroke-counts. Simplified
ha> hanzi according stroke-count make no much sense to people from Taiwan
ha> anyway.

ha> I read this thread in the CLE developer's list (at cle.linux.org.tw)
ha> discussing separate sets for zh_CN and zh_TW between CLE developers
ha> and Ulrich Drepper. The conclusion is that keep two sets of locales
ha> (zh_CN and zh_TW). About making programmer's life miserable, well, we
ha> can have zh_CN.Big5 as well as zh_TW.GB2312. Will it solve the
ha> problem?

ha> Most mainland users are used to lookup hanzi in pinyin, right?

ri> Yes, there are differences between zh_CN and zh_TW, though
ri> probably not as big as you thought. Merging them into one would
ri> be a crazy idea, I don't think anyone even hinted about it. You
ri> are mistaken about the CLE thread. I believe what you refered
ri> to is the one in which Ulrich suggested that a single libc PO
ri> file would be suffient for both locales. That was obviously wrong.
ri> However, I failed to see how that is relevent here!

ri> You also misunderstood what I mean by having same collation
ri> sequence in both locales. We do not care wether the hanzi is
ri> simplified or traditional. We simply count every hanzi in Unicode
ri> and rank them accordingly. That will cover all characters from
ri> GBxxxx and Big5. The lcoaledef program will pick up the suitable
ri> characters when fed with a specific charmap. That's how it works
ri> in the new design of glibc 2.2 locale subsystem. So if there are
ri> two characters exist in both locale, they will have the same
ri> relative order, and by definition collation is only about relative
ri> order. In this sense we have same collation sequence.

ri> As of iso10651, it's an international standard, part of what we
ri> collectively called i18n standards. Our country casted a YES vote for
ri> it last year. The template table certainly needs tailoring, but I would
ri> not say that it's useless.

ha> That is a good point. Since our government voted YES, it might as well
ha> become a mandatory standard some day. That, I don't know how to deal
ha> with. I still like pinyin method but how much chance is that the
ha> iso10651 is adopted as a national standard?

ha> Now I just wish there are someway we can switch the collation used
ha> on the fly. Then we can have multiple sets of collations for a single
ha> locale.

ri> No need to lose your sleep over this. iso10651 does not enforce
ri> any specific collation sequence.

ri> I would like to hear more of your opinion and maybe I should bring this
ri> to a wider discussion to hear what other people say about this. Do you
ri> mind if I post the exchange between us to some chinese linux development
ri> discussion lists, such as debian chinese mail-list (I know you are an
ri> active member of that list).

ha> That is a good idea. Programs are written for people. More
ha> discussion can certainly have a better chance leading to better
ha> solutions. go for it.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Reply to:

Follow-Ups:
- Re: 关于中文locale中汉字的排序
  - From: Thomas Chan <thomas@atlas.datexx.com>

Prev by Date: Re: JMCCE 1.0中文环境
Next by Date: Re: 关于中文locale中汉字的排序
Previous by thread: Re: JMCCE 1.0中文环境
Next by thread: Re: 关于中文locale中汉字的排序
Index(es):
- Date
- Thread