[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

lists.debian.org de-localization (Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages)



Hi,

From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Fri, 03 Jan 2003 09:06:43 +0900 (JST)

> BTW, I found similar trouble in lists.debian.org pages.  In thread-list
> pages or date-list pages like
> 
>   http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html,
> 
> there are no charset specification.  In such cases, web browsers will
> assume these pages according to user preference.  Naturally, Japanese
> people configure web browsers to "assume Japanese encoding for pages
> without charset specification".  On the other hand, the thread-list
> pages show senders' names in <em> format, and threfore, a tag </em>
> follows the name.  If the last letter of the name is 8bit, the tag
> is broken.  The result is that all following part are shown in <em>
> (italic) format.
>
> The test is easy: please configure your browser to "assume Japanese
> encoding for pages without charset specification" and load the above
> page.
>
>
> However, in this case, the solution is a bit complicated.  All mails
> should have encoding information in MIME format.  Thus, the best
> solution would be to parse MIME.  On the other hand, the simplest
> makeshift solution is to add "charset=iso8859-1" for all pages
> but there are mailing lists where most of 8bit characters are
> cyrillic and so on.


I found that MHonArc has a feature to solve this problem.

      http://www.mhonarc.org/MHonArc/doc/faq/mime.html#nonascii

I checked /org/lists.debian.org/mhonarc/debian.rc and found
that it seems to ssume that any 8bit characters are ISO-8859-1.

> <CharsetConverters>
> plain;          mhonarc::htmlize;
> us-ascii;       mhonarc::htmlize;
> iso-8859-1;     mhonarc::htmlize;
> iso-8859-2;     iso_8859::str2sgml;     iso8859.pl
> iso-8859-3;     iso_8859::str2sgml;     iso8859.pl

Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1?

(Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts
ISO-8859 8bit characters into SGML entity like "&ouml;".)

It would be nice if we can convert raw 8bit mail headers (though it is
illegal; it sometimes happens and may cause breaking the lists.debian.org
pages) to SGML entities by assuming they are ISO-8859-1.  Since this may
annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs
which generates illegal mail headers with 8bit characters without charset
specification, I'd like to hear from people from various countries.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: