Re: lists.debian.org de-localization

Tomohiro KUBOTA <debian@tmail.plala.or.jp>:

> The key point is that when we receive a mail with raw 8bit characters,
> we don't have an easy and relyable method to tell the characters are
> from ISO-8859-1 or KOI8-R or other character sets.

If the headers contain 8-bit octets and are valid as UTF-8, it's
fairly safe to assume that they really are UTF-8. Otherwise, you could
look for a Content-Type field or make it depend on the mailing list.

> An easy way is to assume *all* raw 8bit characters to be KOI8-R and
> convert into SGML entity.  However, I don't know whether there are
> some other languages where a certain amount of non-spammer people
> use raw 8bit characters.  If they exist, they will complain on this
> idea.

I thought some Japanese non-spammers use iso-2022-jp in headers, which
isn't 8-bit, but it isn't us-ascii, either. Am I out of date?


