[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.



Hi,

From: Hideki Yamane <henrich@iijmio-mail.jp>
Subject: Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.
Date: Sun, 25 Jan 2004 01:10:41 +0900

>  * tag is OK. That says "content="text/html; charset=ISO-2022-JP"".
>  * It looks like contents is not valid ISO-2022-JP. I don't know why.
>    Frank, would you tell me the way how did you convert it from EUC-JP
>    to ISO-2022-JP ? 

I checked http://packages.debian.org/unstable/misc/language-env.ja.html
and found that closing escape sequences are missing.


ISO-2022-JP is a "stateful" encoding.  It means that a string consists
of escape sequences to determine the "state" and ordinary codes whose
meaning (corresponding characters) depends on the "state".

For example, <Japanese Hiragana A> is:

    1B 24 42 24 22 1B 28 42

where 1B 24 42 (the starting three bytes) means "here starts JIS X 0208
Japanese", 24 22 (following two bytes) is Japanese Hiragana A and the
following 1B 28 42 means "here starts ASCII".  In Japanese state, 24 22
means Japanese Hiragana A while in ASCII state it means Dollar and Double
Quatation.


I said closing escape sequences are missing.  This means the "here starts
ASCII" part is missing.  Thus, all of the following ASCII characters
(including HTML tags) are regarded as Japanese and causes Mojibake.

I don't know what algorithm is used for generating the page, so I have
no idea the reason of this broken page.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/



Reply to: