[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Unicode on Debian web (was: Re: questions on webwml/english/templete/debian/cdimage.wml)



Tomohiro KUBOTA:

> Note that browsers cannot be free from "state" even if they use Unicode.

Not entirely, but Unicode makes it a whole lot easier, at least.

> Thus, though it is true ISO-2022 is very complex, please note Unicode
> is not so simple.  If Unicode were less simpler than human natural
> languages, it means that Unicode has defects.

You are comparing apples and oranges here. The display issues you
describe with Unicode you (can) get with ISO-2022 as well. The problem
here is the recognition code. In ISO-2022 it's very hard to re-sync if
you get lost (transmission failure or broken pages), with UTF-8 it is
very easy.

> Never.  Before appearance of Unicode, these encodings were identical,

Well, there is an (almost) 1-to-1 mapping between Shift-JIS,
ISO-2022-JP and EUC-JP, but they are all relying on partly undefined
parts of the JIS X 0208 standard, and that the fonts that were
displaying the text had exactly those 0208 extensions that the text
were using. You have the IBM extensions, the NEC extensions and some
other, incompatible, extensions. For your text to be transmitted
correctly you needed to make sure your recipient not only understood
your encoding (Shift-JIS) but had a 0208 font with the correct
extensions. This causes a lot of problems.

> For example, Shift_JIS and CP932 is identical if we don't think about
> conversion to/from Unicode.

You're comparing apples and oranges again. Shift-JIS is and encoding of
JIS X 0201 and JIS X 0208, which has been (mis)used by encoding the
vendor extensions, that are not necessarily compatible among computers.
CP932 defines another character set (which is based on JIS X 0201 and
JIS X 0208, but is not identical), so there you do not have the
problem, because the extensions are already predefined.

> Most Japanese people even don't know the name of "CP932" and they
> think they are using Shift_JIS.

This is very similar to the western European problem where people say
that they are using ISO 8859-1, but are using the Windows extensions in
CP1252.

The problem is that since "Shift-JIS" has been changed from an encoding
of 0201 and 0208, it is not guaranteed what underlying character set it
is encoding, so it is *very* hard to figure out what to convert it to
if you convert it to another character encoding (except for EUC-JP and
ISO-2022-JP, of course, since they also can be used to encode this
"unspecified" character set) you get in trouble. This is not a problem
of Unicode, because Unicode is well-defined, but of the used variants
of Shift-JIS, because they are very badly defined.

Even if you add all the extensions that are documented anywhere to your
JIS X 0208 there is always "one more character" in used somewhere, just
because a popular Japanese font manufacturer added it to their fonts
and claimed it to be a 0208 font. Yes, I have seen these problems, and
they are a real mess to implement.

This is why I like to move to Unicode, because it is clearly defined
what it contains.


If you want a background to why I know so much about Japanese encodings
(and Chinese, etc.), consider that I have been working with Opera and
its Unicode adaptation for over a year, a year that co-incided with our
delivery of Opera for QNX for IBM, a delivery that was targetted to
Japan and China. So I have had my share of headaches with this.
Fortunately, we decided from the start to go with Unicode only (and
convert anything that comes in or goes out), I can only imagine the
problems we would have had if we hadn't.

-- 
\\//
peter - http://www.softwolves.pp.se/

  I do not read or respond to mail with HTML attachments.
  Statement concerning unsolicited e-mail according to Swedish law:
  http://www.softwolves.pp.se/peter/reklampost.html



Reply to: