lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)
Hi,
From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Fri, 03 Jan 2003 09:06:43 +0900 (JST)
> BTW, I found similar trouble in lists.debian.org pages. In thread-list
> pages or date-list pages like
>
> http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html,
>
> there are no charset specification. In such cases, web browsers will
> assume these pages according to user preference. Naturally, Japanese
> people configure web browsers to "assume Japanese encoding for pages
> without charset specification". On the other hand, the thread-list
> pages show senders' names in <em> format, and threfore, a tag </em>
> follows the name. If the last letter of the name is 8bit, the tag
> is broken. The result is that all following part are shown in <em>
> (italic) format.
>
> The test is easy: please configure your browser to "assume Japanese
> encoding for pages without charset specification" and load the above
> page.
>
>
> However, in this case, the solution is a bit complicated. All mails
> should have encoding information in MIME format. Thus, the best
> solution would be to parse MIME. On the other hand, the simplest
> makeshift solution is to add "charset=iso8859-1" for all pages
> but there are mailing lists where most of 8bit characters are
> cyrillic and so on.
I found that MHonArc has a feature to solve this problem.
http://www.mhonarc.org/MHonArc/doc/faq/mime.html#nonascii
I checked /org/lists.debian.org/mhonarc/debian.rc and found
that it seems to ssume that any 8bit characters are ISO-8859-1.
> <CharsetConverters>
> plain; mhonarc::htmlize;
> us-ascii; mhonarc::htmlize;
> iso-8859-1; mhonarc::htmlize;
> iso-8859-2; iso_8859::str2sgml; iso8859.pl
> iso-8859-3; iso_8859::str2sgml; iso8859.pl
Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1?
(Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts
ISO-8859 8bit characters into SGML entity like "ö".)
It would be nice if we can convert raw 8bit mail headers (though it is
illegal; it sometimes happens and may cause breaking the lists.debian.org
pages) to SGML entities by assuming they are ISO-8859-1. Since this may
annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs
which generates illegal mail headers with 8bit characters without charset
specification, I'd like to hear from people from various countries.
---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
Reply to: