[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: gb <==> big5 conversion module



:      Anthony Fok说上次我列出来的那些字有一部分是在big5+范围之内的。:)
: +有些gb2312字符到big5有好多种写法,这种情况只有以词为单位转换才
: 能解决。我现在正在做分词的程序,还算顺利,找到一些相关论文,已经写出来一个原型。
: +现在缺少一个gb2312<->big5的词组对应表,gb2312的分词
: 字典我现在用的是unicon-im里面带的词组,big5的字典在xcin里应该能找到。不过这些字
: +□ㄗS有词性 :(,没办法凑合用吧。我目前不打算在
: autoconvert里面调用iconv,因为不是所有平台都用glibc的。 :)
: +字表还是统一起来比较好,呵呵,等你的结果了。
: 
:                                         Yu Guanghui

Not exactly. iconv is a standard facility in all kinds of modern
UNIX systems, including FreeBSD, HP-UX, Solaris, .... etc. And
most of them can do the conversion between differen character sets
and utf-8. However, it is not guarantee that they can do the
conversion between big5 and gb2312. If they can't, it should be
treated as a bug.

But from your post, you are doing the project which has the functions
beyond the iconv :-)) Yes, you are right, for tranditional and
simplified Chinese specific, we should write special program to
handle the complex conversion, but not left it to iconv. These
includes the character set mapping, Tsi (phrases) mapping, etc.

But in any cases, I think we should also have a reliable iconv
which could at least do the simple mapping between big5 and gb2312.
Although many gb2312 characters could map to many big5 characters,
it does not matter. We just need a simple/commonly available
interface to do that. At least we should not encounter un-convertable
(but in fact they should be convertable) characters as in the
current status. I think this is the goal we implement the iconv
module for big5 <==> gb2312.

So, if in gb2312 there contains several characters only available
in big5+, I purpose that these characters could be neglect. Unlease
in the future we want glibc to support big5+ :-))

: > Before left for vacation, I was also working on writing a gb <==> big5
: > gconv module. The first part of my plan was to establish a "best" mapping
: > between gb and big5. I did not take any existing conversion table because
: > none of them documented how they got their conversions and I don't feel
: > comfortable with that. So I roll my own and took this opportunity to check a
: > few popular gb <==> big5 converters. Most of this work has been finished.
: > All the gb -> big5 conversions have been checked, but there some big5 -> gb
: > conversions left. The result so far looks good. Compare with the table of
: > 130+ unmapped gb codes posted by Yu Guanghui a while ago, 35 of them are
: > mapped in my table. There are 4 codes not mapped in my table, but mapped in
: > autoconvert. However I suspect that autoconvert made mistake in all 4 cases.
: > I'll write a more detailed post describing my methodology, conversion
: > table and the comparison results in next few days. Then I'd like to hear
: > from you. If we all agree upon it, it's fairly easy to write the module.
: > Hopefully it will be in time for 2.2.1 release which is said to be soon.

Thanks very much for your work :-)) I would be glad to help you in
development and testing. :-))

T.H.Hsieh

-- 
| This message was re-posted from debian-chinese-big5@lists.debian.org
| and converted from big5 to gb2312 by an automatic gateway.



Reply to: