[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

关于中文locale中汉字的排序



大家好, 我是glibc2.2中文lcoale zh_CN的作者. 目前我正在准备改版, 想问问大
家对于汉字collation的想法. 我知道这不是debian专门的问题, 但locale作为i18n
和l10n的基础, 对debian-chinese有很大影响, 而且这里云集了两岸三地暨海外的
linux高手, 我渴望听到各位的意见. 希望大家不介意这个slightly off-topic post.

先非常简单的介绍一下glibc2.2. 目前还处于beta阶段. 与国际化有关的最显著
的变化包括重写locale系统和实现了widechar I/O函数. 新的locale基于ISO 14652,
localedef程序支持多字节字符, locale定义文件与charmap独立, 即只有一个zh_CN,
不再需要分别的zh_CN.GB2312, zh_CN.GBK等定义文件.

Collate是locale中的一个category, 用来规定字串比较时字符的顺序. 我们关心的
当然是汉字的顺序. 一般不外采用部首+笔画排序或拼音排序. 在此之前的中文lcoale
定义, 包括hashao兄的zh_CN.GB2312, 陈向阳兄的zh_CN.GBK和谢东翰兄的zh_TW.Big5,
都直接采用按编码(encoding)排序. 因为旧版glibc对中文的支持有限, collate基本
没用. 现在我们终于有了差不多彻底支持i18n的glibc, 也许有必要认真考率汉字
collation的问题.

几经斟酌, 我最终决定使用部首+笔画排序, 原因如下:

1. Unicode中的汉字是采用部首+笔画排序的, 部首+笔画的顺序就是Unicode编码的顺序.
   目前locale定义文件中的编码全用Unicode, 这样就不需将27000汉字一一列出, 文件
   简捷, 易于维护.
2. 在Unicode的官方文件中, 大约有7000汉字没有拼音. 这些汉字都是冷僻字或来自日韩
   的汉字, 由我们自己赋予拼音, 是件浩大的工程, 几近不可能. 这是我放弃拼音排序
   的最主要原因. (如果您知道更全且据权威性的mapping table, 请告我).
3. 由于多音字的存在, 我对拼音排序是否有意义有疑惑. 因为我们不能像字典一个字
   出现在多页, 这里一个字只能有一个位置.
4. 我们中文目前有zh_CN, zh_TW和zh_HK三个locales, 由于地区文化的差异, 这是不
   可避免的. 但我私心以为在能够相同的地方, 我们应尽量努力不使其不同. 采用部
   首+笔画排序应该是大家都可以接收的, 这样我们就有了相同的collation. (很高兴
   看到谢东翰兄在新的zh_TW中采纳了相同的排序).
5. 使用笔画排序的collate并不是说在程序内部就不能使用拼音对字串进行排序. 程序
   员可以作任何事情! 只有strcoll函数使用collate, 大家也许没用过strcoll, 但绝
   对用过 strcmp, 想想是不是在绝大多数情况下都是这麽用的:
                if ( strcmp(str1, str2) == 0 )
		   ...
		else
		   ...
   我们只关心两个串是否相同, 很少有需要给它们排个序. 从这意义来讲, 欧洲语言
   更需要collate, 因为否则没法知道比如"ll"与"l"相等(西班牙语). 中文所有不同的
   汉字都是不等的, 无论如何排序都问题不大. 我不认为任何特定的排序会对编程有
   特别的帮助. 最终用户看不到collate, 所以也不会去关心.
   如果您觉得用拼音排序对简体汉字编程有什么好处, 比如易写易维护, 显著改善性能
   等, 我将很乐於看到您给个例子.

写得太长了, 就此打住, 希望各位发表高见. 同时也希望大家帮助测试glibc2.2及中文
locale. 不知debian是否有beta/alpha/whatever版包括glibc2.2的?

最后附上最近与hashao兄对此问题讨论的email往还, 感谢hashao兄的讨论, 让我认真
想了很多问题.

Rigel





ha>   Regarding the zh_CN definition, could you make the hanzi part of
ha>the collate follow the pinyin order? Both ja_JP and Ko_KR use their own
ha>collate instead of the iso10651_lt. The iso10651_lt is pretty useless
ha>for hanzi.
ha>
ha>    A pinyin table and a script should do the job.
ha>
ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a
ha>Uni2Pinyin.gz file. But it is kinda old (1996).

   ri> The hanzi collation sequence of glibc 2.2 zh_CN locale is in hanzi's   
   ri> stroke-count order. That's how CJK unified ideographs are arranged in
   ri> Unicode, if I'm not mistaken. For a while, I was tossed between using
   ri> stroke-count order and pinyin order when developed the locale. Eventually
   ri> the decision was made in favoring of stroke-count order because of the
   ri> following considerations:

   ri> 1. The mapping between hanzi and pinyin is incomplete in Unicode 3.0.
   ri>    By that I mean not every hanzi has a pinyin associate with.
   ri>    Specifically, none of the 6582 from newly added CJK unified
   ri>    ideographs extension A has pinyin. Even for the long existing CJK
   ri>    unified ideographs, 558 of them do not have pinyin either. The
   ri>    table (Uni2Pinyin) you mentioned does not help much here since it
   ri>    was based on Unicode 1.1. I'm also reluctant to accept any ad hoc
   ri>    mapping tables, and prefer those from international or
   ri>    national standard bodies, or credible research institutes.

      ha> Fair enough. But I think we can make the pinyin table
      ha> complete. We can always update the data when there are
      ha> some official standards published.

         ri> Did you realize how BIG the job is to assign pinyin to ~7000
	 ri> obscured hanzi? And how do you suggest we do with the collation
	 ri> BEFORE we have them complete?

   ri> 2. A fair number of hanzi have multiple pronunciations. This certainly
   ri> will invite ambiguity and debate as which one should be picked as the
   ri> primary pinyin.

      ha> Well, I would guess that most hanzi that had multiple pronunciations
      ha> are frequently used ones. Frequently used hanzi are sorted in GB2312
      ha> according to pinyin. Pinyin for hanzi with multiple pronunciations are
      ha> decided by the most frequently used pronunciation for the hanzi. I
      ha> find is solution is clean and simple. So a map to the GB2312 can
      ha> be used when there are ambiguity.

         ri> Well, I see the difference here is that you are satisfied with
	 ri> "most", while I'm looking for a solution for "all". What do you
	 ri> suggest to do if there is  one case that can no be mapped to
	 ri> GB2312? What about 2 cases? 3 cases? ..., and I can guarantee you
	 ri> that there are plenty.

	 ri> Another problem with the existence of multi-pronunciation is that
	 ri> the programmers can not reliably depend on the collation based
	 ri> sorting. Because one can not assume which pronunciation the user
	 ri> intended to. Going with the most frequent pronunciation is not a
	 ri> solution, because sometimes user might indeed look for a rarely
	 ri> used pronunciation.

   ri> 3. Arranging hanzi in stroke-count order is a very natural way and has
   ri> long history of acceptance in China. Since it also happen to be
   ri> coincident with Unicode code value order, we can rank hanzi simply by
   ri> their Unicode value which is another tradition started by ASCII. This
   ri> allows us to write a concise locale definition file, instead of
   ri> enumerating each every one code (note, that's more than 27000 of them).

      ha> The stroke-count order has long history of acceptance in China is
      ha> related to that there was no pronunciation standard and a symbolic
      ha> system to represent the pronunciation. (Okey, I don't know much about
      ha> it. Just my personal impression.) Even since we have a standard pinyin
      ha> system, we are more familiar with the pinyin ordering system. And it
      ha> is natural for our language, I would say. Stroke-count surely looks
      ha> natural for ideograph. but when we think about Chinese, we more likely
      ha> think how it sounds (well, except those wubi input freaks. hehe).
      ha> That is why we adopt pinyin index system on most Chinese dictionaries.
      ha> Yes, there are dictionaries like ci2hai3 use stroke-count as main
      ha> index system. But everyday Joe won't use ci2hai3 that often.
      ha> 
      ha> When I look at a sorted list of hanzi, I would find that is quicker
      ha> to jump to a hanzi term by using its pronunciation than using its
      ha> stroke-count.

         ri> Using whatever collation sequence does not prohibit sorting the
	 ri> hanzi strings using pinyin order. It's programmer's choice and
	 ri> has nothing to do with collation.

      ha> As for ASCII analog, I would think that is why there are collation
      ha> table in locale definition. Applications should use locale collation
      ha> rules instead of the code value order to sort things. Speed and
      ha> size are not problems as hardwares today have no problem to coup with
      ha> it. (That remain me at the 386 era, decoding jpeg is such a cpu hungry
      ha> task that speed is the only thing programs like qpeg, sea advertised.
      ha> And people buy that! Now a day, no one cares how fast you can decode a
      ha> jpeg file.) If you were talking about effort to make a Chinese
      ha> collation table with 27000 entries, yes, it is a much more than just
      ha> use unicode code value scheme.
      
         ri> That is absolutely NOT true. Applications have every rights to use
	 ri> whatever rules they see fit to do the sorting. No one force you
	 ri> to use locale collation. As a matter of fact, among all the c
	 ri> library functions, only four use locale collation: strcoll,
	 ri> strxfrm, wcscoll, and wcsxrfm, and the later two are simply the
	 ri> widechar counterpart of the first two.

   ri> 4. Pinyin order might not make much sense to people from Taiwen, at
   ri> least for now. Using two different collation sequences for hanzi in
   ri> zh_CN and zh_TW is, in my opinion, a disaster. That will make
   ri> programmer's life miserable. Now that we have a chance to adopt the
   ri> same one, why let it escape?!

      ha> It is true that two zh_CN and zh_TW are not very applaudable. However,
      ha> we have to face the reality that there are differences in cultures
      ha> here. Different ways for monetary notion, different way in calender
      ha> notion. For stroke-count, traditional hanzi and simplified hanzi have
      ha> totally different and sometime inconsistent stroke-counts. Simplified
      ha> hanzi according stroke-count make no much sense to people from Taiwan
      ha> anyway.

      ha> I read this thread in the CLE developer's list (at cle.linux.org.tw)
      ha> discussing separate sets for zh_CN and zh_TW between CLE developers
      ha> and Ulrich Drepper. The conclusion is that keep two sets of locales
      ha> (zh_CN and zh_TW). About making programmer's life miserable, well, we
      ha> can have zh_CN.Big5 as well as zh_TW.GB2312. Will it solve the
      ha> problem?

      ha> Most mainland users are used to lookup hanzi in pinyin, right?

         ri> Yes, there are differences between zh_CN and zh_TW, though
	 ri> probably not as big as you thought. Merging them into one would
	 ri> be a crazy idea, I don't think anyone even hinted about it. You
	 ri> are mistaken about the CLE thread.  I believe what you refered
	 ri> to is the one in which Ulrich suggested that a single libc PO
	 ri> file would be suffient for both locales. That was obviously wrong.
	 ri> However, I failed to see how that is relevent here!

	 ri> You also misunderstood what I mean by having same collation
	 ri> sequence in both locales. We do not care wether the hanzi is
	 ri> simplified or traditional. We simply count every hanzi in Unicode
	 ri> and rank them accordingly. That will cover all characters from
	 ri> GBxxxx and Big5. The lcoaledef program will pick up the suitable
	 ri> characters when fed with a specific charmap. That's how it works
	 ri> in the new design of glibc 2.2 locale subsystem. So if there are
	 ri> two characters exist in both locale, they will have the same
	 ri> relative order, and by definition collation is only about relative
	 ri> order. In this sense we have same collation sequence.

   ri> As of iso10651, it's an international standard, part of what we
   ri> collectively called i18n standards. Our country casted a YES vote for
   ri> it last year. The template table certainly needs tailoring, but I would
   ri> not say that it's useless.

      ha> That is a good point. Since our government voted YES, it might as well
      ha> become a mandatory standard some day. That, I don't know how to deal
      ha> with. I still like pinyin method but how much chance is that the
      ha> iso10651 is adopted as a national standard?

      ha> Now I just wish there are someway we can switch the collation used
      ha> on the fly. Then we can have multiple sets of collations for a single
      ha> locale.

         ri> No need to lose your sleep over this. iso10651 does not enforce
	 ri> any specific  collation sequence.

   ri> I would like to hear more of your opinion and maybe I should bring this
   ri> to a wider discussion to hear what other people say about this. Do you
   ri> mind if I post the exchange between us to some chinese linux development
   ri> discussion lists, such as debian chinese mail-list (I know you are an
   ri> active member of that list).

      ha> That is a good idea. Programs are written for people. More
      ha> discussion can certainly have a better chance leading to better
      ha> solutions. go for it.


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




Reply to: