[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: 關¤_中文locale中º~字的排序



On Mon, 11 Sep 2000, rigel wrote:

> 幾經斟酌, 我最終決定使用部首+筆畫排序, 原因如下:
> 
> 1. Unicode中的漢字是採用部首+筆畫排序的, 部首+筆畫的順序就是Unicode編碼的順序.
>    目前locale定義文件中的編碼全用Unicode, 這樣就不需將27000漢字一一列出, 文件
>    簡捷, 易于維護.

This is the simplest solution.  However, does this mean all the characters
in CJK Extension A (U+3400 ...) come first, followed by the characters in
CJK Unified Ideographs (U+4E00 ...), and finally followed by CJK Extension
B (U+20000 ...)[1], no matter what radical or stroke count?  And there may
be CJK Extension C to deal with, if/when it comes out...

[1] Coming out very soon with Unicode 3.1 and ISO 10646-2:2001.


> 2. 在Unicode的官方文件中, 大約有7000漢字沒有拼音. 這些漢字都是冷僻字或來自日韓
>    的漢字, 由我們自己賦予拼音, 是件浩大的工程, 幾近不可能. 這是我放棄拼音排序
>    的最主要原因. (如果您知道更全且據權威性的mapping table, 請告我).

There are a lot of problems with assigning pronunications (pinyin or
otherwise) to every character.  Almost all Japanese, Korean, and
Vietnamese ones do not have Chinese readings, and Chinese dialect
characters usually don't have Pinyin readings, as well as many
characters that no one knows the readings for and/or what they mean but
they are listed in the large dictionaries.

e.g., how many people know the Pinyin reading for 囝 (子 inside 囗) is
jian?  (It means 'child' in Min 閩語.)

~7000 missing readings is also nothing compared to the 40,000+ missing
readings for the characters in CJK Extension B!  Who wants to fill them
in? :(


> ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a
> ha>Uni2Pinyin.gz file. But it is kinda old (1996).

There is a file in the Big5+ package from CMEX (http://www.cmex.org.tw/)
which gives readings (in Zhuyin Fuhao, but they can be converted) for most
of the 20,902 characters of Unicode 1.1.  I recall they used one
dictionary for most of them (I think Taiwan pronunciation standards), and
then another for a few others, and left the dialect and Japanese/Korean
ones blank.

 
>    ri>    was based on Unicode 1.1. I'm also reluctant to accept any ad hoc
>    ri>    mapping tables, and prefer those from international or
>    ri>    national standard bodies, or credible research institutes.

I do not trust just any source, either.  Many do not document where their
information came from, such as the UNIHAN.TXT file, or even if they are a
composite of multiple sources!

 
>       ha> Well, I would guess that most hanzi that had multiple pronunciations
>       ha> are frequently used ones. Frequently used hanzi are sorted in GB2312
>       ha> according to pinyin. Pinyin for hanzi with multiple pronunciations are
>       ha> decided by the most frequently used pronunciation for the hanzi. I
>       ha> find is solution is clean and simple. So a map to the GB2312 can
>       ha> be used when there are ambiguity.
> 	 ri> Another problem with the existence of multi-pronunciation is that
> 	 ri> the programmers can not reliably depend on the collation based
> 	 ri> sorting. Because one can not assume which pronunciation the user
> 	 ri> intended to. Going with the most frequent pronunciation is not a
> 	 ri> solution, because sometimes user might indeed look for a rarely
> 	 ri> used pronunciation.

Actually, from what I've seen in the _Hanyu Da Zidian_ 漢語大字典, a lot
of characters, including infrequently-used ones, have multiple readings.
A lot of frequent characters have multiple readings, but most people don't
know about them.  e.g., 她 is usually ta, but can also be chi (used in
girl's names) or jie (same as 姐).

 
>       ha> The stroke-count order has long history of acceptance in China is
>       ha> related to that there was no pronunciation standard and a symbolic
>       ha> system to represent the pronunciation. (Okey, I don't know much about
>       ha> it. Just my personal impression.) Even since we have a standard pinyin

Well, there are always the pronounciation-based orders in rhymebooks like
the _Guangyun_ 廣韻, but very few people would be able to use such a
sorting order--maybe your literature professor. :)  Plus it doesn't have
every character out there...

Radical and stroke count has the advantage of only requiring you to see
the character.  (There are disagreements over which radical and how many
strokes sometimes, but relatively minor.)

 
Thomas Chan
tc31@cornell.edu






Reply to: