關¤_中文locale中º~字的排序
大家好, 我是glibc2.2中文lcoale zh_CN的作者. 目前我正在準備改版, 想問問大
家對于漢字collation的想法. 我知道這不是debian專門的問題, 但locale作為i18n
和l10n的基礎, 對debian-chinese有很大影響, 而且這裡雲集了兩岸三地暨海外的
linux高手, 我渴望聽到各位的意見. 希望大家不介意這個slightly off-topic post.
先非常簡單的介紹一下glibc2.2. 目前還處于beta階段. 與國際化有關的最顯著
的變化包括重寫locale系統和實現了widechar I/O函數. 新的locale基于ISO 14652,
localedef程序支持多字節字符, locale定義文件與charmap獨立, 即只有一個zh_CN,
不再需要分別的zh_CN.GB2312, zh_CN.GBK等定義文件.
Collate是locale中的一個category, 用來規定字串比較時字符的順序. 我們關心的
當然是漢字的順序. 一般不外採用部首+筆畫排序或拼音排序. 在此之前的中文lcoale
定義, 包括hashao兄的zh_CN.GB2312, 陳向陽兄的zh_CN.GBK和謝東翰兄的zh_TW.Big5,
都直接採用按編碼(encoding)排序. 因為舊版glibc對中文的支持有限, collate基本
沒用. 現在我們終于有了差不多徹底支持i18n的glibc, 也許有必要認真考率漢字
collation的問題.
幾經斟酌, 我最終決定使用部首+筆畫排序, 原因如下:
1. Unicode中的漢字是採用部首+筆畫排序的, 部首+筆畫的順序就是Unicode編碼的順序.
目前locale定義文件中的編碼全用Unicode, 這樣就不需將27000漢字一一列出, 文件
簡捷, 易于維護.
2. 在Unicode的官方文件中, 大約有7000漢字沒有拼音. 這些漢字都是冷僻字或來自日韓
的漢字, 由我們自己賦予拼音, 是件浩大的工程, 幾近不可能. 這是我放棄拼音排序
的最主要原因. (如果您知道更全且據權威性的mapping table, 請告我).
3. 由于多音字的存在, 我對拼音排序是否有意義有疑惑. 因為我們不能像字典一個字
出現在多頁, 這裡一個字只能有一個位置.
4. 我們中文目前有zh_CN, zh_TW和zh_HK三個locales, 由于地區文化的差異, 這是不
可避免的. 但我私心以為在能夠相同的地方, 我們應盡量努力不使其不同. 採用部
首+筆畫排序應該是大家都可以接收的, 這樣我們就有了相同的collation. (很高興
看到謝東翰兄在新的zh_TW中採納了相同的排序).
5. 使用筆畫排序的collate並不是說在程序內部就不能使用拼音對字串進行排序. 程序
員可以作任何事情! 只有strcoll函數使用collate, 大家也許沒用過strcoll, 但絕
對用過 strcmp, 想想是不是在絕大多數情況下都是這麼用的:
if ( strcmp(str1, str2) == 0 )
...
else
...
我們只關心兩個串是否相同, 很少有需要給它們排個序. 從這意義來講, 歐洲語言
更需要collate, 因為否則沒法知道比如"ll"與"l"相等(西班牙語). 中文所有不同的
漢字都是不等的, 無論如何排序都問題不大. 我不認為任何特定的排序會對編程有
特別的幫助. 最終用戶看不到collate, 所以也不會去關心.
如果您覺得用拼音排序對簡體漢字編程有什麼好處, 比如易寫易維護, 顯著改善性能
等, 我將很樂於看到您給個例子.
寫得太長了, 就此打住, 希望各位發表高見. 同時也希望大家幫助測試glibc2.2及中文
locale. 不知debian是否有beta/alpha/whatever版包括glibc2.2的?
最後附上最近與hashao兄對此問題討論的email往還, 感謝hashao兄的討論, 讓我認真
想了很多問題.
Rigel
ha> Regarding the zh_CN definition, could you make the hanzi part of
ha>the collate follow the pinyin order? Both ja_JP and Ko_KR use their own
ha>collate instead of the iso10651_lt. The iso10651_lt is pretty useless
ha>for hanzi.
ha>
ha> A pinyin table and a script should do the job.
ha>
ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a
ha>Uni2Pinyin.gz file. But it is kinda old (1996).
ri> The hanzi collation sequence of glibc 2.2 zh_CN locale is in hanzi's
ri> stroke-count order. That's how CJK unified ideographs are arranged in
ri> Unicode, if I'm not mistaken. For a while, I was tossed between using
ri> stroke-count order and pinyin order when developed the locale. Eventually
ri> the decision was made in favoring of stroke-count order because of the
ri> following considerations:
ri> 1. The mapping between hanzi and pinyin is incomplete in Unicode 3.0.
ri> By that I mean not every hanzi has a pinyin associate with.
ri> Specifically, none of the 6582 from newly added CJK unified
ri> ideographs extension A has pinyin. Even for the long existing CJK
ri> unified ideographs, 558 of them do not have pinyin either. The
ri> table (Uni2Pinyin) you mentioned does not help much here since it
ri> was based on Unicode 1.1. I'm also reluctant to accept any ad hoc
ri> mapping tables, and prefer those from international or
ri> national standard bodies, or credible research institutes.
ha> Fair enough. But I think we can make the pinyin table
ha> complete. We can always update the data when there are
ha> some official standards published.
ri> Did you realize how BIG the job is to assign pinyin to ~7000
ri> obscured hanzi? And how do you suggest we do with the collation
ri> BEFORE we have them complete?
ri> 2. A fair number of hanzi have multiple pronunciations. This certainly
ri> will invite ambiguity and debate as which one should be picked as the
ri> primary pinyin.
ha> Well, I would guess that most hanzi that had multiple pronunciations
ha> are frequently used ones. Frequently used hanzi are sorted in GB2312
ha> according to pinyin. Pinyin for hanzi with multiple pronunciations are
ha> decided by the most frequently used pronunciation for the hanzi. I
ha> find is solution is clean and simple. So a map to the GB2312 can
ha> be used when there are ambiguity.
ri> Well, I see the difference here is that you are satisfied with
ri> "most", while I'm looking for a solution for "all". What do you
ri> suggest to do if there is one case that can no be mapped to
ri> GB2312? What about 2 cases? 3 cases? ..., and I can guarantee you
ri> that there are plenty.
ri> Another problem with the existence of multi-pronunciation is that
ri> the programmers can not reliably depend on the collation based
ri> sorting. Because one can not assume which pronunciation the user
ri> intended to. Going with the most frequent pronunciation is not a
ri> solution, because sometimes user might indeed look for a rarely
ri> used pronunciation.
ri> 3. Arranging hanzi in stroke-count order is a very natural way and has
ri> long history of acceptance in China. Since it also happen to be
ri> coincident with Unicode code value order, we can rank hanzi simply by
ri> their Unicode value which is another tradition started by ASCII. This
ri> allows us to write a concise locale definition file, instead of
ri> enumerating each every one code (note, that's more than 27000 of them).
ha> The stroke-count order has long history of acceptance in China is
ha> related to that there was no pronunciation standard and a symbolic
ha> system to represent the pronunciation. (Okey, I don't know much about
ha> it. Just my personal impression.) Even since we have a standard pinyin
ha> system, we are more familiar with the pinyin ordering system. And it
ha> is natural for our language, I would say. Stroke-count surely looks
ha> natural for ideograph. but when we think about Chinese, we more likely
ha> think how it sounds (well, except those wubi input freaks. hehe).
ha> That is why we adopt pinyin index system on most Chinese dictionaries.
ha> Yes, there are dictionaries like ci2hai3 use stroke-count as main
ha> index system. But everyday Joe won't use ci2hai3 that often.
ha>
ha> When I look at a sorted list of hanzi, I would find that is quicker
ha> to jump to a hanzi term by using its pronunciation than using its
ha> stroke-count.
ri> Using whatever collation sequence does not prohibit sorting the
ri> hanzi strings using pinyin order. It's programmer's choice and
ri> has nothing to do with collation.
ha> As for ASCII analog, I would think that is why there are collation
ha> table in locale definition. Applications should use locale collation
ha> rules instead of the code value order to sort things. Speed and
ha> size are not problems as hardwares today have no problem to coup with
ha> it. (That remain me at the 386 era, decoding jpeg is such a cpu hungry
ha> task that speed is the only thing programs like qpeg, sea advertised.
ha> And people buy that! Now a day, no one cares how fast you can decode a
ha> jpeg file.) If you were talking about effort to make a Chinese
ha> collation table with 27000 entries, yes, it is a much more than just
ha> use unicode code value scheme.
ri> That is absolutely NOT true. Applications have every rights to use
ri> whatever rules they see fit to do the sorting. No one force you
ri> to use locale collation. As a matter of fact, among all the c
ri> library functions, only four use locale collation: strcoll,
ri> strxfrm, wcscoll, and wcsxrfm, and the later two are simply the
ri> widechar counterpart of the first two.
ri> 4. Pinyin order might not make much sense to people from Taiwen, at
ri> least for now. Using two different collation sequences for hanzi in
ri> zh_CN and zh_TW is, in my opinion, a disaster. That will make
ri> programmer's life miserable. Now that we have a chance to adopt the
ri> same one, why let it escape?!
ha> It is true that two zh_CN and zh_TW are not very applaudable. However,
ha> we have to face the reality that there are differences in cultures
ha> here. Different ways for monetary notion, different way in calender
ha> notion. For stroke-count, traditional hanzi and simplified hanzi have
ha> totally different and sometime inconsistent stroke-counts. Simplified
ha> hanzi according stroke-count make no much sense to people from Taiwan
ha> anyway.
ha> I read this thread in the CLE developer's list (at cle.linux.org.tw)
ha> discussing separate sets for zh_CN and zh_TW between CLE developers
ha> and Ulrich Drepper. The conclusion is that keep two sets of locales
ha> (zh_CN and zh_TW). About making programmer's life miserable, well, we
ha> can have zh_CN.Big5 as well as zh_TW.GB2312. Will it solve the
ha> problem?
ha> Most mainland users are used to lookup hanzi in pinyin, right?
ri> Yes, there are differences between zh_CN and zh_TW, though
ri> probably not as big as you thought. Merging them into one would
ri> be a crazy idea, I don't think anyone even hinted about it. You
ri> are mistaken about the CLE thread. I believe what you refered
ri> to is the one in which Ulrich suggested that a single libc PO
ri> file would be suffient for both locales. That was obviously wrong.
ri> However, I failed to see how that is relevent here!
ri> You also misunderstood what I mean by having same collation
ri> sequence in both locales. We do not care wether the hanzi is
ri> simplified or traditional. We simply count every hanzi in Unicode
ri> and rank them accordingly. That will cover all characters from
ri> GBxxxx and Big5. The lcoaledef program will pick up the suitable
ri> characters when fed with a specific charmap. That's how it works
ri> in the new design of glibc 2.2 locale subsystem. So if there are
ri> two characters exist in both locale, they will have the same
ri> relative order, and by definition collation is only about relative
ri> order. In this sense we have same collation sequence.
ri> As of iso10651, it's an international standard, part of what we
ri> collectively called i18n standards. Our country casted a YES vote for
ri> it last year. The template table certainly needs tailoring, but I would
ri> not say that it's useless.
ha> That is a good point. Since our government voted YES, it might as well
ha> become a mandatory standard some day. That, I don't know how to deal
ha> with. I still like pinyin method but how much chance is that the
ha> iso10651 is adopted as a national standard?
ha> Now I just wish there are someway we can switch the collation used
ha> on the fly. Then we can have multiple sets of collations for a single
ha> locale.
ri> No need to lose your sleep over this. iso10651 does not enforce
ri> any specific collation sequence.
ri> I would like to hear more of your opinion and maybe I should bring this
ri> to a wider discussion to hear what other people say about this. Do you
ri> mind if I post the exchange between us to some chinese linux development
ri> discussion lists, such as debian chinese mail-list (I know you are an
ri> active member of that list).
ha> That is a good idea. Programs are written for people. More
ha> discussion can certainly have a better chance leading to better
ha> solutions. go for it.
_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com
--
| This message was re-posted from debian-chinese-gb@lists.debian.org
| and converted from gb2312 to big5 by an automatic gateway.
Reply to: