[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

關¤_中文locale中º~字的排序



大家好, 我是glibc2.2中文lcoale zh_CN的作者. 目前我正在準備改版, 想問問大
家對于漢字collation的想法. 我知道這不是debian專門的問題, 但locale作為i18n
和l10n的基礎, 對debian-chinese有很大影響, 而且這裡雲集了兩岸三地暨海外的
linux高手, 我渴望聽到各位的意見. 希望大家不介意這個slightly off-topic post.

先非常簡單的介紹一下glibc2.2. 目前還處于beta階段. 與國際化有關的最顯著
的變化包括重寫locale系統和實現了widechar I/O函數. 新的locale基于ISO 14652,
localedef程序支持多字節字符, locale定義文件與charmap獨立, 即只有一個zh_CN,
不再需要分別的zh_CN.GB2312, zh_CN.GBK等定義文件.

Collate是locale中的一個category, 用來規定字串比較時字符的順序. 我們關心的
當然是漢字的順序. 一般不外採用部首+筆畫排序或拼音排序. 在此之前的中文lcoale
定義, 包括hashao兄的zh_CN.GB2312, 陳向陽兄的zh_CN.GBK和謝東翰兄的zh_TW.Big5,
都直接採用按編碼(encoding)排序. 因為舊版glibc對中文的支持有限, collate基本
沒用. 現在我們終于有了差不多徹底支持i18n的glibc, 也許有必要認真考率漢字
collation的問題.

幾經斟酌, 我最終決定使用部首+筆畫排序, 原因如下:

1. Unicode中的漢字是採用部首+筆畫排序的, 部首+筆畫的順序就是Unicode編碼的順序.
   目前locale定義文件中的編碼全用Unicode, 這樣就不需將27000漢字一一列出, 文件
   簡捷, 易于維護.
2. 在Unicode的官方文件中, 大約有7000漢字沒有拼音. 這些漢字都是冷僻字或來自日韓
   的漢字, 由我們自己賦予拼音, 是件浩大的工程, 幾近不可能. 這是我放棄拼音排序
   的最主要原因. (如果您知道更全且據權威性的mapping table, 請告我).
3. 由于多音字的存在, 我對拼音排序是否有意義有疑惑. 因為我們不能像字典一個字
   出現在多頁, 這裡一個字只能有一個位置.
4. 我們中文目前有zh_CN, zh_TW和zh_HK三個locales, 由于地區文化的差異, 這是不
   可避免的. 但我私心以為在能夠相同的地方, 我們應盡量努力不使其不同. 採用部
   首+筆畫排序應該是大家都可以接收的, 這樣我們就有了相同的collation. (很高興
   看到謝東翰兄在新的zh_TW中採納了相同的排序).
5. 使用筆畫排序的collate並不是說在程序內部就不能使用拼音對字串進行排序. 程序
   員可以作任何事情! 只有strcoll函數使用collate, 大家也許沒用過strcoll, 但絕
   對用過 strcmp, 想想是不是在絕大多數情況下都是這麼用的:
                if ( strcmp(str1, str2) == 0 )
		   ...
		else
		   ...
   我們只關心兩個串是否相同, 很少有需要給它們排個序. 從這意義來講, 歐洲語言
   更需要collate, 因為否則沒法知道比如"ll"與"l"相等(西班牙語). 中文所有不同的
   漢字都是不等的, 無論如何排序都問題不大. 我不認為任何特定的排序會對編程有
   特別的幫助. 最終用戶看不到collate, 所以也不會去關心.
   如果您覺得用拼音排序對簡體漢字編程有什麼好處, 比如易寫易維護, 顯著改善性能
   等, 我將很樂於看到您給個例子.

寫得太長了, 就此打住, 希望各位發表高見. 同時也希望大家幫助測試glibc2.2及中文
locale. 不知debian是否有beta/alpha/whatever版包括glibc2.2的?

最後附上最近與hashao兄對此問題討論的email往還, 感謝hashao兄的討論, 讓我認真
想了很多問題.

Rigel

ha>   Regarding the zh_CN definition, could you make the hanzi part of
ha>the collate follow the pinyin order? Both ja_JP and Ko_KR use their own
ha>collate instead of the iso10651_lt. The iso10651_lt is pretty useless
ha>for hanzi.
ha>
ha>    A pinyin table and a script should do the job.
ha>
ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a
ha>Uni2Pinyin.gz file. But it is kinda old (1996).

   ri> The hanzi collation sequence of glibc 2.2 zh_CN locale is in hanzi's   
   ri> stroke-count order. That's how CJK unified ideographs are arranged in
   ri> Unicode, if I'm not mistaken. For a while, I was tossed between using
   ri> stroke-count order and pinyin order when developed the locale. Eventually
   ri> the decision was made in favoring of stroke-count order because of the
   ri> following considerations:

   ri> 1. The mapping between hanzi and pinyin is incomplete in Unicode 3.0.
   ri>    By that I mean not every hanzi has a pinyin associate with.
   ri>    Specifically, none of the 6582 from newly added CJK unified
   ri>    ideographs extension A has pinyin. Even for the long existing CJK
   ri>    unified ideographs, 558 of them do not have pinyin either. The
   ri>    table (Uni2Pinyin) you mentioned does not help much here since it
   ri>    was based on Unicode 1.1. I'm also reluctant to accept any ad hoc
   ri>    mapping tables, and prefer those from international or
   ri>    national standard bodies, or credible research institutes.

      ha> Fair enough. But I think we can make the pinyin table
      ha> complete. We can always update the data when there are
      ha> some official standards published.

         ri> Did you realize how BIG the job is to assign pinyin to ~7000
	 ri> obscured hanzi? And how do you suggest we do with the collation
	 ri> BEFORE we have them complete?

   ri> 2. A fair number of hanzi have multiple pronunciations. This certainly
   ri> will invite ambiguity and debate as which one should be picked as the
   ri> primary pinyin.

      ha> Well, I would guess that most hanzi that had multiple pronunciations
      ha> are frequently used ones. Frequently used hanzi are sorted in GB2312
      ha> according to pinyin. Pinyin for hanzi with multiple pronunciations are
      ha> decided by the most frequently used pronunciation for the hanzi. I
      ha> find is solution is clean and simple. So a map to the GB2312 can
      ha> be used when there are ambiguity.

         ri> Well, I see the difference here is that you are satisfied with
	 ri> "most", while I'm looking for a solution for "all". What do you
	 ri> suggest to do if there is  one case that can no be mapped to
	 ri> GB2312? What about 2 cases? 3 cases? ..., and I can guarantee you
	 ri> that there are plenty.

	 ri> Another problem with the existence of multi-pronunciation is that
	 ri> the programmers can not reliably depend on the collation based
	 ri> sorting. Because one can not assume which pronunciation the user
	 ri> intended to. Going with the most frequent pronunciation is not a
	 ri> solution, because sometimes user might indeed look for a rarely
	 ri> used pronunciation.

   ri> 3. Arranging hanzi in stroke-count order is a very natural way and has
   ri> long history of acceptance in China. Since it also happen to be
   ri> coincident with Unicode code value order, we can rank hanzi simply by
   ri> their Unicode value which is another tradition started by ASCII. This
   ri> allows us to write a concise locale definition file, instead of
   ri> enumerating each every one code (note, that's more than 27000 of them).

      ha> The stroke-count order has long history of acceptance in China is
      ha> related to that there was no pronunciation standard and a symbolic
      ha> system to represent the pronunciation. (Okey, I don't know much about
      ha> it. Just my personal impression.) Even since we have a standard pinyin
      ha> system, we are more familiar with the pinyin ordering system. And it
      ha> is natural for our language, I would say. Stroke-count surely looks
      ha> natural for ideograph. but when we think about Chinese, we more likely
      ha> think how it sounds (well, except those wubi input freaks. hehe).
      ha> That is why we adopt pinyin index system on most Chinese dictionaries.
      ha> Yes, there are dictionaries like ci2hai3 use stroke-count as main
      ha> index system. But everyday Joe won't use ci2hai3 that often.
      ha> 
      ha> When I look at a sorted list of hanzi, I would find that is quicker
      ha> to jump to a hanzi term by using its pronunciation than using its
      ha> stroke-count.

         ri> Using whatever collation sequence does not prohibit sorting the
	 ri> hanzi strings using pinyin order. It's programmer's choice and
	 ri> has nothing to do with collation.

      ha> As for ASCII analog, I would think that is why there are collation
      ha> table in locale definition. Applications should use locale collation
      ha> rules instead of the code value order to sort things. Speed and
      ha> size are not problems as hardwares today have no problem to coup with
      ha> it. (That remain me at the 386 era, decoding jpeg is such a cpu hungry
      ha> task that speed is the only thing programs like qpeg, sea advertised.
      ha> And people buy that! Now a day, no one cares how fast you can decode a
      ha> jpeg file.) If you were talking about effort to make a Chinese
      ha> collation table with 27000 entries, yes, it is a much more than just
      ha> use unicode code value scheme.
      
         ri> That is absolutely NOT true. Applications have every rights to use
	 ri> whatever rules they see fit to do the sorting. No one force you
	 ri> to use locale collation. As a matter of fact, among all the c
	 ri> library functions, only four use locale collation: strcoll,
	 ri> strxfrm, wcscoll, and wcsxrfm, and the later two are simply the
	 ri> widechar counterpart of the first two.

   ri> 4. Pinyin order might not make much sense to people from Taiwen, at
   ri> least for now. Using two different collation sequences for hanzi in
   ri> zh_CN and zh_TW is, in my opinion, a disaster. That will make
   ri> programmer's life miserable. Now that we have a chance to adopt the
   ri> same one, why let it escape?!

      ha> It is true that two zh_CN and zh_TW are not very applaudable. However,
      ha> we have to face the reality that there are differences in cultures
      ha> here. Different ways for monetary notion, different way in calender
      ha> notion. For stroke-count, traditional hanzi and simplified hanzi have
      ha> totally different and sometime inconsistent stroke-counts. Simplified
      ha> hanzi according stroke-count make no much sense to people from Taiwan
      ha> anyway.

      ha> I read this thread in the CLE developer's list (at cle.linux.org.tw)
      ha> discussing separate sets for zh_CN and zh_TW between CLE developers
      ha> and Ulrich Drepper. The conclusion is that keep two sets of locales
      ha> (zh_CN and zh_TW). About making programmer's life miserable, well, we
      ha> can have zh_CN.Big5 as well as zh_TW.GB2312. Will it solve the
      ha> problem?

      ha> Most mainland users are used to lookup hanzi in pinyin, right?

         ri> Yes, there are differences between zh_CN and zh_TW, though
	 ri> probably not as big as you thought. Merging them into one would
	 ri> be a crazy idea, I don't think anyone even hinted about it. You
	 ri> are mistaken about the CLE thread.  I believe what you refered
	 ri> to is the one in which Ulrich suggested that a single libc PO
	 ri> file would be suffient for both locales. That was obviously wrong.
	 ri> However, I failed to see how that is relevent here!

	 ri> You also misunderstood what I mean by having same collation
	 ri> sequence in both locales. We do not care wether the hanzi is
	 ri> simplified or traditional. We simply count every hanzi in Unicode
	 ri> and rank them accordingly. That will cover all characters from
	 ri> GBxxxx and Big5. The lcoaledef program will pick up the suitable
	 ri> characters when fed with a specific charmap. That's how it works
	 ri> in the new design of glibc 2.2 locale subsystem. So if there are
	 ri> two characters exist in both locale, they will have the same
	 ri> relative order, and by definition collation is only about relative
	 ri> order. In this sense we have same collation sequence.

   ri> As of iso10651, it's an international standard, part of what we
   ri> collectively called i18n standards. Our country casted a YES vote for
   ri> it last year. The template table certainly needs tailoring, but I would
   ri> not say that it's useless.

      ha> That is a good point. Since our government voted YES, it might as well
      ha> become a mandatory standard some day. That, I don't know how to deal
      ha> with. I still like pinyin method but how much chance is that the
      ha> iso10651 is adopted as a national standard?

      ha> Now I just wish there are someway we can switch the collation used
      ha> on the fly. Then we can have multiple sets of collations for a single
      ha> locale.

         ri> No need to lose your sleep over this. iso10651 does not enforce
	 ri> any specific  collation sequence.

   ri> I would like to hear more of your opinion and maybe I should bring this
   ri> to a wider discussion to hear what other people say about this. Do you
   ri> mind if I post the exchange between us to some chinese linux development
   ri> discussion lists, such as debian chinese mail-list (I know you are an
   ri> active member of that list).

      ha> That is a good idea. Programs are written for people. More
      ha> discussion can certainly have a better chance leading to better
      ha> solutions. go for it.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

-- 
| This message was re-posted from debian-chinese-gb@lists.debian.org
| and converted from gb2312 to big5 by an automatic gateway.



Reply to: