[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages



Hi,

From: barbier@linuxfr.org (Denis Barbier)
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Thu, 2 Jan 2003 16:24:59 +0100

> I find only 18 names in people.names containing non-ASCII letters,
> so /org/www.debian.org/cron/people_scripts/people.pl could contain
> some extra elsif in its canonical_names function to replace
> non-ASCII letters by HTML entities.  Most names seem to be ISO-8859-1
> encoded.
> When done, this script could also skip maintainers with non-ASCII
> letters which are not processed in order to prevent future trouble.

I think the simplest filter (assume ISO-8859-1) would be like following:

  s/\xa0/ /g;
  s/\xa1/¡/g;
  s/\xa2/¢/g;
    :
    :
  s/\xff/ÿ/g;

However, as you said, it may cause future trouble.

I think it is also a good idea to simply skip (remove) non-ASCII characters
as you said, because it can be very simply implemented.  After avoiding
tag breaking by this solution, we have enough time to think about UTF-8
filter.


BTW, I found similar trouble in lists.debian.org pages.  In thread-list
pages or date-list pages like

  http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html,

there are no charset specification.  In such cases, web browsers will
assume these pages according to user preference.  Naturally, Japanese
people configure web browsers to "assume Japanese encoding for pages
without charset specification".  On the other hand, the thread-list
pages show senders' names in <em> format, and threfore, a tag </em>
follows the name.  If the last letter of the name is 8bit, the tag
is broken.  The result is that all following part are shown in <em>
(italic) format.

The test is easy: please configure your browser to "assume Japanese
encoding for pages without charset specification" and load the above
page.


However, in this case, the solution is a bit complicated.  All mails
should have encoding information in MIME format.  Thus, the best
solution would be to parse MIME.  On the other hand, the simplest
makeshift solution is to add "charset=iso8859-1" for all pages
but there are mailing lists where most of 8bit characters are
cyrillic and so on.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: