automatically-generated ISO-8859-1 characters in mulbibyte webpages
Hi,
I found that the page of http://www.debian.org/devel/people.ja.html
is very dirty. ALL characters are written in boldface (i.e.,
<strong> format) after some cirtain point.
This occurs because of 8bit (i.e., non-ASCII) characters in
developers' names. When such characters (I guess most of them are
intended to be ISO-8859-1) are used in developers' names, these
characters appear in the webpage. In multibyte encodings, 8bit
codepoints (0x80 - 0xff) are regarded as the first byte of multibyte
characters. Then, the following byte is regarded as the second
byte of THE multibyte character. Imagine such 8bit character is
used at the last of a developer's name. The following character
is "<" from "</strong>". Then the "<" of "</strong>" will be
regarded as the second byte of multibyte character and "<" itself
will be missing. Thus, "</strong>" will be "/strong>", a broken
tag.
This causes the webpage very dirty. Please watch the webpage by
some browsers ... because of broken "</strong>", all following
parts are displayed in <strong> format!
I imagine the solution would be either of followings:
1. Regard all 8bit characters to be ISO-8859-1 and replace these
characters with &foobar; expression. For example, 0xfc will
replaced with "ü". The problem of this solution is that
we have to assume 8bit characters to be ISO-8859-1. This means
that this solution disturbs developers to switch from ISO-8859-1
into UTF-8, which is a very bad thing.
2. Force all developers to use ASCII or UTF-8 in their names and
the script to generate people.name will assume all 8bit characters
are UTF-8. All other encodings such as ISO-8859-1 or EUC-JP will
be forbidden.
The problem of this solution is that ISO-8859-1(15) people will complain.
However, IMHO, this is an unfair priviledge of ISO-8859-1(15) people,
and more, such an unequal situation disturbs promotion of i18n.
Anyway, this will need a huge energy to persuade ISO-8859-1(15) people.
3. Though we don't force developers to switch to UTF-8, the script
to generate people.name will regard all 8bit characters to be
UTF-8. Since few 8bit characters are UTF-8 in developers' names
so far, most of non-ASCII characters in people.html will be lost.
(Anyway, all non-ASCII characters ARE now lost in
people.<multibyte languages>.html pages).
However, the broken-tag-problem will be solved. If develoers will
switch into UTF-8, names of these developers will be displayed well.
---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
Reply to: