[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: gb <==> big5 conversion module



:      Anthony Fok說上次我列出來的那些字有一部分是在big5+范圍之內的。:)
: +有些gb2312字符到big5有好多種寫法,這種情況只有以詞為單位轉換才
: 能解決。我現在正在做分詞的程序,還算順利,找到一些相關論文,已經寫出來一個原型。
: +現在缺少一個gb2312<->big5的詞組對應表,gb2312的分詞
: 字典我現在用的是unicon-im裡面帶的詞組,big5的字典在xcin裡應該能找到。不過這些字
: +撜ㄗS有詞性 :(,沒辦法湊合用吧。我目前不打算在
: autoconvert裡面調用iconv,因為不是所有平台都用glibc的。 :)
: +字表還是統一起來比較好,呵呵,等你的結果了。
: 
:                                         Yu Guanghui

Not exactly. iconv is a standard facility in all kinds of modern
UNIX systems, including FreeBSD, HP-UX, Solaris, .... etc. And
most of them can do the conversion between differen character sets
and utf-8. However, it is not guarantee that they can do the
conversion between big5 and gb2312. If they can't, it should be
treated as a bug.

But from your post, you are doing the project which has the functions
beyond the iconv :-)) Yes, you are right, for tranditional and
simplified Chinese specific, we should write special program to
handle the complex conversion, but not left it to iconv. These
includes the character set mapping, Tsi (phrases) mapping, etc.

But in any cases, I think we should also have a reliable iconv
which could at least do the simple mapping between big5 and gb2312.
Although many gb2312 characters could map to many big5 characters,
it does not matter. We just need a simple/commonly available
interface to do that. At least we should not encounter un-convertable
(but in fact they should be convertable) characters as in the
current status. I think this is the goal we implement the iconv
module for big5 <==> gb2312.

So, if in gb2312 there contains several characters only available
in big5+, I purpose that these characters could be neglect. Unlease
in the future we want glibc to support big5+ :-))

: > Before left for vacation, I was also working on writing a gb <==> big5
: > gconv module. The first part of my plan was to establish a "best" mapping
: > between gb and big5. I did not take any existing conversion table because
: > none of them documented how they got their conversions and I don't feel
: > comfortable with that. So I roll my own and took this opportunity to check a
: > few popular gb <==> big5 converters. Most of this work has been finished.
: > All the gb -> big5 conversions have been checked, but there some big5 -> gb
: > conversions left. The result so far looks good. Compare with the table of
: > 130+ unmapped gb codes posted by Yu Guanghui a while ago, 35 of them are
: > mapped in my table. There are 4 codes not mapped in my table, but mapped in
: > autoconvert. However I suspect that autoconvert made mistake in all 4 cases.
: > I'll write a more detailed post describing my methodology, conversion
: > table and the comparison results in next few days. Then I'd like to hear
: > from you. If we all agree upon it, it's fairly easy to write the module.
: > Hopefully it will be in time for 2.2.1 release which is said to be soon.

Thanks very much for your work :-)) I would be glad to help you in
development and testing. :-))


T.H.Hsieh



Reply to: