[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Translators! What's your charset?



At Wed, 11 Oct 2000 14:33:30 +1100,
Craig Small <csmall@eye-net.com.au> wrote:

> I cannot do the following ones until I get a charset:

> jp - iso-2022-jp
  `ja' if you refere language code, not country code.

Anyway, iso-2022-jp is not so easy.
It uses US-ASCII chars and escape sequences, such as "ESC $ B".
In iso-2022-jp, if byte sequence encounter "ESC $ B" or "ESC $ @", 
then byte sequence switch to JIS X 0208 characters, which are 94x94 
that means 2 byte char until "ESC ( B" or "ESC ( J", which means
to back US-ASCII (or JIS X0201 Roman)
So we can't say which byte set make words in iso-2022-jp encoding.

For more information, you can see RFC1468: Japanese Character Encoding 
for Internet Messages http://www.rfc-editor.org/rfc/rfc1468.txt

> For the dual-byte folks, I don't think this will work.  The upstream
> author is willing to work with you, but he's not sure how to do it.
> Actually it may work... if you put both bytes into the charset.
> Depends on what your whitespace looks like.

Whitespace in iso-2022-jp is 0x20 and sequence "0x21 0x21" ("!!") 
between "0x1B 0x24 0x42" ("ESC $ B") or "0x1B 0x24 0x40" ("ESC $ @")
and "0x1B 0x28 0x42" ("ESC ( B") or "0x1B 0x28 0x4A" ("ESC ( J")
However, Japanese language doesn't use whitespace to separate words.
This is why we need a tool such as chasen - Japanese Morphological 
Analysis System.

Regards,
Fumitoshi UKAI



Reply to: