[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Chinese big5 encoding and PO files



On Wed, Jan 29, 2003 at 11:14:56AM +0100, Peter Karlsson wrote:
> Denis Barbier:
> 
> > Err, ascii(7) tells me that 0x5C *is* a backslash.
> 
> Yes, but these documents aren't ASCII, so 0x5C may not or may not be a
> backslash there, depending on where they are located in the file.

Ok.

> > Could you please have a look at chinese/po/others.zh.po and tell me
> > what to do with Subscribe/Unsubscribe translations?
> 
> Nothing should need to be done, since the 0x5C byte is the trail byte
> of the character, a proper MBCS aware string scanner will recognize
> that it is not a backslash character (unlike, for instance, in the
> "please respect the ad policy" string a bit further down, which *does*
> contain a backslash in the translation). Getting the string scanner to
> work properly requires configuring the locales properly.

The problem with current WML is that streams are bytes and not characters,
this is why 0x5C bytes have to be escaped.
I am preparing a character oriented version, but there are major backward
compatibility problems.  It means that any single file must contain only
one encoding, some files have to be fixed under webwml.

> Big5 is a bit problematic since it allows non-highbit characters as
> trail bytes, similar to the problems with ISO 2022-JP. A stateful
> string scanner is required to handle it properly. LibC should work fine
> as long as the proper locale is available, and I am pretty sure that
> the gettext utilities will handle this properly.

Yes, gettext is safe.

Instead of escaping some problematic characters, a better solution could
be to perform encoding conversions (as with Japanese files) to a safe
encoding.  Is there anyone interested in testing this scheme?

Denis



Reply to: