Re: Translators! What's your charset?
At Wed, 11 Oct 2000 14:33:30 +1100,
Craig Small <csmall@eye-net.com.au> wrote:
> I cannot do the following ones until I get a charset:
> jp - iso-2022-jp
`ja' if you refere language code, not country code.
Anyway, iso-2022-jp is not so easy.
It uses US-ASCII chars and escape sequences, such as "ESC $ B".
In iso-2022-jp, if byte sequence encounter "ESC $ B" or "ESC $ @",
then byte sequence switch to JIS X 0208 characters, which are 94x94
that means 2 byte char until "ESC ( B" or "ESC ( J", which means
to back US-ASCII (or JIS X0201 Roman)
So we can't say which byte set make words in iso-2022-jp encoding.
For more information, you can see RFC1468: Japanese Character Encoding
for Internet Messages http://www.rfc-editor.org/rfc/rfc1468.txt
> For the dual-byte folks, I don't think this will work. The upstream
> author is willing to work with you, but he's not sure how to do it.
> Actually it may work... if you put both bytes into the charset.
> Depends on what your whitespace looks like.
Whitespace in iso-2022-jp is 0x20 and sequence "0x21 0x21" ("!!")
between "0x1B 0x24 0x42" ("ESC $ B") or "0x1B 0x24 0x40" ("ESC $ @")
and "0x1B 0x28 0x42" ("ESC ( B") or "0x1B 0x28 0x4A" ("ESC ( J")
However, Japanese language doesn't use whitespace to separate words.
This is why we need a tool such as chasen - Japanese Morphological
Analysis System.
Regards,
Fumitoshi UKAI
Reply to: