Bug#99933: Comments on Unicode

To: "Raul Miller" <moth@debian.org>, <99933@bugs.debian.org>
Subject: Bug#99933: Comments on Unicode
From: "David Starner" <dstarner98@aasaa.ofe.org>
Date: Fri, 6 Jul 2001 04:36:25 +0100
Message-id: <02a801c105cc$d6343ee0$ae4efea9@dvdeug>
Reply-to: "David Starner" <dstarner98@aasaa.ofe.org>, 99933@bugs.debian.org
References: <010201c10517$164398c0$ae4efea9@dvdeug> <20010705133736.C12776@usatoday.com>

Raul Miller <moth@debian.org>
> On Thu, Jul 05, 2001 at 06:55:24AM +0100, David Starner wrote:
> > I don't know where you got this impression, but it's wrong. Read the
> > document. It introduces a  TAG START character, Ascii-equivelent tag
> > characters, and a TAG CANCEL character. <EN-US>You can label text like
> > this.<DE-DE>Ja, du kanst.<TAG CANCEL>
>
> Except that you're not supposed to use this mechanism with HTML, and
> unlike XML, in HTML the language can only be identified in the mime
> header.

That's an HTML problem. Does Debian use enough mixed language HTML to
actually make that a problem? If so, it's not a problem XHTML has.

> Do you have any idea whether the problems identified at
> http://support.microsoft.com/support/kb/articles/Q170/5/59.ASP
> have been resolved?

Are they a problem for us? Windows Code Page 932 may or may not correspond
to anything that we care about. (At a glance, at least one of each pair that
both correspond to the same Unicode character is not in the real JIS X
0218.) The problems have not been resolved; they are inherent in the fact
Unicode was designed. Needless to say, not all the choices made for Unicode
were the same as those made for CP932, and that manifests in the fact that
characters do not always correspond one to one between the two standards.

> Prior to Unicode 3.1 the code space was 16 bits.

NO. Since Unicode 2.0, the code space has been 21 bits. The ONLY thing that
Unicode 3.1 did, is put characters above U+FFFF. It did not change the
fundamental structure of Unicode in the least.

> In principle, at least, with the additional code space unicode can have a
> 1-to-1 mapping with the characters represented in the shift jis standards.

Unicode has a one to one mapping with the characters in JIS X 0208, the
basis for all Unix Japanese encodings. That it fails in completely encoding
some proprietory encodings is inevitable.

> Once unicode can act as a super set for every character set we currently
> support, we can use it as such.  Until then, we can't.

If Unicode were a super set for every character set that anyone needs to
support, it would be worthless and completely unusable. The creators also
realized that a perfect proposal, ignoring backward compatibility, would go
nowhere. Unicode is a carefully balanced compromise between the two
problems. However, if we currently support any character set well, it is
through a Unicode based glibc - I don't believe libc accepts the existance
of any character set that can't be mapped to Unicode. So arguably, yes,
Unicode is a super set for every character set we currently support well.

--
David Starner - dstarner98@aasaa.ofe.org

Reply to:

Follow-Ups:
- Bug#99933: Comments on Unicode
  - From: Raul Miller <moth@debian.org>

References:
- Bug#99933: Comments on Unicode
  - From: "David Starner" <dstarner98@aasaa.ofe.org>
- Bug#99933: Comments on Unicode
  - From: Raul Miller <moth@debian.org>

Prev by Date: Bug#99933: Comments on Unicode
Next by Date: Bug#99933: Comments on Unicode
Previous by thread: Bug#99933: Comments on Unicode
Next by thread: Bug#99933: Comments on Unicode
Index(es):
- Date
- Thread