[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

automatically-generated ISO-8859-1 characters in mulbibyte webpages


I found that the page of http://www.debian.org/devel/people.ja.html
is very dirty.  ALL characters are written in boldface (i.e.,
<strong> format) after some cirtain point.

This occurs because of 8bit (i.e., non-ASCII) characters in
developers' names.  When such characters (I guess most of them are
intended to be ISO-8859-1) are used in developers' names, these
characters appear in the webpage.  In multibyte encodings, 8bit
codepoints (0x80 - 0xff) are regarded as the first byte of multibyte
characters.  Then, the following byte is regarded as the second
byte of THE multibyte character.  Imagine such 8bit character is
used at the last of a developer's name.  The following character
is "<" from "</strong>".  Then the "<" of "</strong>" will be
regarded as the second byte of multibyte character and "<" itself
will be missing.  Thus, "</strong>" will be "/strong>", a broken

This causes the webpage very dirty.  Please watch the webpage by
some browsers ... because of broken "</strong>", all following
parts are displayed in <strong> format!

I imagine the solution would be either of followings:

1. Regard all 8bit characters to be ISO-8859-1 and replace these
   characters with &foobar; expression.  For example, 0xfc will
   replaced with "&uuml;".  The problem of this solution is that
   we have to assume 8bit characters to be ISO-8859-1.  This means
   that this solution disturbs developers to switch from ISO-8859-1
   into UTF-8, which is a very bad thing.

2. Force all developers to use ASCII or UTF-8 in their names and
   the script to generate people.name will assume all 8bit characters
   are UTF-8.  All other encodings such as ISO-8859-1 or EUC-JP will
   be forbidden.
   The problem of this solution is that ISO-8859-1(15) people will complain.
   However, IMHO, this is an unfair priviledge of ISO-8859-1(15) people,
   and more, such an unequal situation disturbs promotion of i18n.
   Anyway, this will need a huge energy to persuade ISO-8859-1(15) people.

3. Though we don't force developers to switch to UTF-8, the script
   to generate people.name will regard all 8bit characters to be
   UTF-8.  Since few 8bit characters are UTF-8 in developers' names
   so far, most of non-ASCII characters in people.html will be lost.
   (Anyway, all non-ASCII characters ARE now lost in
   people.<multibyte languages>.html pages).
   However, the broken-tag-problem will be solved.  If develoers will
   switch into UTF-8, names of these developers will be displayed well.

Tomohiro KUBOTA <kubota@debian.org>

Reply to: