[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Release announcement simplified Chinese translation update

Deng Xiyue <manphiz-guest@users.alioth.debian.org> writes:

> Anthony Wong <ypwong@gmail.com> writes:
>> 2009/2/17 Arne Goetje <arne@linux.org.tw>
>>> Matt Kraai wrote:
>>> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote:
>>> >> Matt Kraai wrote:
>>> >>> We already appear to use a single source version for all three Chinese
>>> >>> translations: Big5.  Whether it's possible to change to UTF-8 is for
>>> >>> someone more familiar with Chinese to say.  It's not sufficient to
>>> >>> just switch the encoding of this file, though:
>>> >>>
>>> >>>  $ make
>>> >>>  cd . && wml -q -D CUR_YEAR=2009 -o
>> UNDEFuZH@uCNuCNHKuCNTW:20090214.zh-cn.html.tmp@g+w -o
>> UNDEFuZH@uHKuCNHKuHKTW:20090214.zh-hk.html.tmp@g+w -o
>> UNDEFuZH@uTWuCNTWuHKTW:20090214.zh-tw.html.tmp@g+w --prolog=../../bin/
>> fix_big5.pl  20090214.wml
>>> >>>   * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence at
>> position 233
>>> >>>  make: *** [20090214.zh-cn.html] Error 1
>>> >>>
>>> >> Doesn't surprise me. A number of characters which are present in Big5
>>> >> are not present in GB2312 (and vice versa). Using iconv to convert those
>>> >> characters will lead to such errors.
>>> >>
>>> >> zh-autoconvert might give better results.
>>> >>
>>> >> Else, if you can give me the link to the source, then I can take a look.
>>> >
>>> > Sure, it's available in the webwml CVS module at
>>> > chinese/News/2009/20090214.wml.  You can find instructions for
>>> > accessing the repository at
>>> >
>>> >  http://www.debian.org/devel/website/using_cvs
>>> >
>>> OK, attached are the results for review.
>>> Build-Depends: zh-autoconvert
>>> To convert from Big5 into GB2312:
>>>        autob5 -o gb < 20090214.wml > 20090214_gb2312.wml
>>> To convert from Big5 into UTF-8:
>>>        autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml
>>> To convert from Big5 into simplified Chinese UTF-8:
>>>        autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml
>>> I used the latter two commands to generate the attached files.
>>> The difference between iconv and zh-autoconvert is that iconv simply
>>> tries to convert the codepoints one to one and zh-autoconvert uses a
>>> dictionary to map traditional characters to their simplified
>>> counterparts. Since the database is quite old, it may not work for
>>> simplified <-> traditional mappings where simplified characters have
>>> been added later (GBK) or where the document contains HKSCS characters,
>>> which use the Big5 Private Use Area. Those characters cannot be converted.
>>> I have long wanted to create a new library where a full Unicode
>>> compatible mapping takes place. Unfortunately I don't have the time for
>>> that. But if there are any volunteers out there, I'm willing to
>>> coordinate such a project.
>>> Cheers
>>> Arne
>> Hi all,
>> I have been thinking that using Big5 as the primary encoding for both
>> TC (Traditional Chinese) and SC (Simplified Chinese) versions of
>> Debian website are detrimental to user contributions. To summarize the
>> current situation of the Chinese versions of Debian website,
>> translations must be done in Big5 WML files, TC version is basically
>> converted simply from WML to HTML, but to generate the SC versions,
>> Big5 files must be converted to GB2312 first. It is done so due to the
>> one-to-many SC-TC mappings problem. To deal with the differences of
>> terms for the same meaning in TC and SC, like 文件 and 檔案, we use a
>> simple mapping table written in Perl and for some terms that are
>> rarely used, inline WML substitution syntax is used, like [CN:文
>> 件:][HKTW:檔案:].
>> This puts a hurdle for SC users to submit translations to Debian,
>> because they write in SC but then have to use whatever method to
>> convert it to Big5 for submission. And there is also the possibility
>> that the converted Big5 file may not contain proper TC
>> words/phrases. It also gives people the impression that SC
>> contributors are treated like "second-class citizens" (am I too
>> sensitive?).  Not to mention that Big5 and GB2312 are both considered
>> as outdated encodings now and should better be replaced by UTF-8, to
>> make the same file accessible to both TC and SC users.
>> I suggest 1. to convert all existing Chinese WML files for the Debian website
>> from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do
>> both TC-SC and SC-TC conversions. This way, we no longer need to care which
>> script the translators use and the burden for them to use Big5 is lifted.
>> For MediaWiki's Chinese conversion system, please see:
>>  1. http://meta.wikimedia.org/wiki/
>>     Automatic_conversion_between_simplified_and_traditional_Chinese
>>  2. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/
>>     ZhConversion.php?revision=47314&view=markup
>>  3. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant
>>  4. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hans
>> Any comments?
>> --
>> Anthony
> It is great to migrate to UTF-8 encoding to ease encoding conversion.
> However, I'm a little bit concerned with solution for automatic dialect
> handling in mediawiki, which is complicated and possibly error-prone.
> It'll be good if the inline diversion solution currently in use can be
> retained.  Plus, several diversions that are synonyms can be unified, as
> the example given above.  Ideas?

By example, instead of Anthony Wang's [CN:文件:][HKTW:檔案:] which does
require differentiate, I mean this one:


As least in China mainland, both versions are used, so that I guess it
can be unified to "力盡所能" :)

Deng Xiyue

Reply to: