Re: Release announcement simplified Chinese translation update
2009/2/17 Arne Goetje <email@example.com>
> Matt Kraai wrote:
> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote:
> >> Matt Kraai wrote:
> >>> We already appear to use a single source version for all three Chinese
> >>> translations: Big5. Whether it's possible to change to UTF-8 is for
> >>> someone more familiar with Chinese to say. It's not sufficient to
> >>> just switch the encoding of this file, though:
> >>> $ make
> >>> cd . && wml -q -D CUR_YEAR=2009 -o UNDEFuZH@uCNuCNHKuCNTW:20090214.zh-cn.html.tmp@g+w -o UNDEFuZH@uHKuCNHKuHKTW:20090214.zh-hk.html.tmp@g+w -o UNDEFuZH@uTWuCNTWuHKTW:20090214.zh-tw.html.tmp@g+w --prolog=../../bin/fix_big5.pl 20090214.wml
> >>> * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence at position 233
> >>> make: *** [20090214.zh-cn.html] Error 1
> >> Doesn't surprise me. A number of characters which are present in Big5
> >> are not present in GB2312 (and vice versa). Using iconv to convert those
> >> characters will lead to such errors.
> >> zh-autoconvert might give better results.
> >> Else, if you can give me the link to the source, then I can take a look.
> > Sure, it's available in the webwml CVS module at
> > chinese/News/2009/20090214.wml. You can find instructions for
> > accessing the repository at
> > http://www.debian.org/devel/website/using_cvs
> OK, attached are the results for review.
> Build-Depends: zh-autoconvert
> To convert from Big5 into GB2312:
> autob5 -o gb < 20090214.wml > 20090214_gb2312.wml
> To convert from Big5 into UTF-8:
> autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml
> To convert from Big5 into simplified Chinese UTF-8:
> autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml
> I used the latter two commands to generate the attached files.
> The difference between iconv and zh-autoconvert is that iconv simply
> tries to convert the codepoints one to one and zh-autoconvert uses a
> dictionary to map traditional characters to their simplified
> counterparts. Since the database is quite old, it may not work for
> simplified <-> traditional mappings where simplified characters have
> been added later (GBK) or where the document contains HKSCS characters,
> which use the Big5 Private Use Area. Those characters cannot be converted.
> I have long wanted to create a new library where a full Unicode
> compatible mapping takes place. Unfortunately I don't have the time for
> that. But if there are any volunteers out there, I'm willing to
> coordinate such a project.
I have been thinking that using Big5 as the primary encoding for both TC (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website are detrimental to user contributions. To summarize the current situation of the Chinese versions of Debian website, translations must be done in Big5 WML files, TC version is basically converted simply from WML to HTML, but to generate the SC versions, Big5 files must be converted to GB2312 first. It is done so due to the one-to-many SC-TC mappings problem. To deal with the differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we use a simple mapping table written in Perl and for some terms that are rarely used, inline WML substitution syntax is used, like [CN:文件:][HKTW:檔案:].
This puts a hurdle for SC users to submit translations to Debian, because they write in SC but then have to use whatever method to convert it to Big5 for submission. And there is also the possibility that the converted Big5 file may not contain proper TC words/phrases. It also gives people the impression that SC contributors are treated like "second-class citizens" (am I too sensitive?). Not to mention that Big5 and GB2312 are both considered as outdated encodings now and should better be replaced by UTF-8, to make the same file accessible to both TC and SC users.
I suggest 1. to convert all existing Chinese WML files for the Debian website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do both TC-SC and SC-TC conversions. This way, we no longer need to care which script the translators use and the burden for them to use Big5 is lifted.
For MediaWiki's Chinese conversion system, please see: