Bug#99324: Default charset should be UTF-8

To: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>, 99324@bugs.debian.org
Subject: Bug#99324: Default charset should be UTF-8
From: Raul Miller <moth@debian.org>
Date: Mon, 11 Jun 2001 09:07:21 -0400
Message-id: <[🔎] 20010611090721.A12776@usatoday.com>
Reply-to: Raul Miller <moth@debian.org>, 99324@bugs.debian.org
In-reply-to: <[🔎] 20010611104113.A15114@melkor.dnp.fmph.uniba.sk>; from garabik@melkor.dnp.fmph.uniba.sk on Mon, Jun 11, 2001 at 10:41:13AM +0200
References: <[🔎] 20010611104113.A15114@melkor.dnp.fmph.uniba.sk>

On Mon, Jun 11, 2001 at 10:41:13AM +0200, Radovan Garabik wrote:
> Please read the proposal carefully (especially Marco and Junichi).
> Writting (converting into) documents in UTF-8 is "should"

I'm quite aware of that.  [There's also a "should" on using a single
character set within a package.]

Unfortunately, I can't find a copy of the proposal right now.  I searched
my debian-policy mail box for recent policy that might be it, and I
searched the the entire debian bugs archive 99324.  I know I saw this
most recent policy, but I can't find it.  Do you know where it is?

Anyways, the point of my earlier questions was whether a better policy
could be written.  I agree that the current policy proposal won't
introduce any policy bugs, but that's not the only issue.

> (and IMHO should be "must" but debian maintainers seem not to be
> ready for that yet)

It's not that just that debian maintainers are not ready.  Unicode isn't
ready for oriental languages:

[1] Unicode is missing the code points to distinguish between
different languages.  [In the case of Japanese, for instance,
JIS X0208 is also flawed, but defines more Japanese characters
than Unicode does.]  This means that only in combination with
a particular set of unicode fonts does unicode work right.  [See
http://www.debian.or.jp/~kubota/unicode-unihan.html for a description
of the problems of one unicode font in the context of Japanese.]

[2] Unicode is (was?) missing code points to express all the characters
used within the language (this is particularly significant with names --
I don't know if there are other contexts where this is important).

[3] (less important for debian documentation) Unicode is not 8 bit clean.
[In the case of Japanese, for instance, EUC-JP is a 7-bit a instance of
JIS X0208 which supports the common subset of the JIS X0208 characters).

[4] (may just be my ignorance) I don't know if Debian has a full set of
Unicode fonts to properly represent text in the various major oriental
languages.

> only for debian control files and English language documentation
> (if any non-english characters occur there).
> For documentation in other languages, it is merely an encouragement.

I don't remember anything in that policy that made specific reference
to any language.

> Well, there is one issue I thought of... package can include
> documentation in different encodings (such as README.koi8,
> README.ascii, README.alt). This should be allowed. Perhaps the
> sentence "Package may (at the discretion of the maintainer) include
> documentation files in other encodings, if they are present also in
> canonical encoding, and if the encodings used are clearly marked"
> should be added to the proposal?

This sounds like a very good idea.

For languages and contexts where a specific font is required to properly
read Unicode, it would also be a good idea to clearly indicate that font.

How about:

 "Package may (at the discretion of the maintainer) include
  documentation files in other encodings, if they are present also in
  canonical encoding, and if the encodings used are clearly marked. 
  If a particular font is required, that should be clearly marked."

Jürgen A. Erhard <juergen.erhard@gmx.net> wrote:
> >     >*Addition to 13.5 Preferred documentation formats:
> >     >
> >     >HTML documents, if in encoding other than us-ascii, must have
> >     >in their header an appropriate META tag describing the used
> >     >encoding.
> >
> > Shouldn't that be "iso-8859-1 (latin1)" instead of "us-ascii"?  As,
> > IIRC, that is the official default encoding for HTML (according to
> > RFC2854/RFC2616).
> 
> It used to be, in HTML 2.0, but HTML 4.0 says it is ISO 10646 (but
> does not tell if UTF-8 or UTF-16 or even directly UCS-4)

Note that Unicode 3.1 defines a mapping between UTF-8/UTF-16 and UTF-32,
and that UTF-32 is essentially just UCS-4 with Unicode semantics (21
significant bits of characters).  

Note, however, that there's a problem with X where font metrics on ISO
10646 fonts occupies at least a megabyte in the application (because of
the low-level data structure currently used in xlib for font metrics).

> I would not comment about the situation in Japan, since I obviously
> know nothing about it (although common sense says it is better to have
> one encoding instead of several incompatible ones),

That could just as easily be an argument for ascii over latin-1.

I think it's better to address the specific faults of unicode.

However: I also do not know much about the situation in Japan.  Nor do
I know about the situation in Korea.  Nor do I know about the situations
in China.  To properly address this issue, we need the advice of people
who are familiar with each area -- even though some may not speak english,
and may not subscribe to debian policy.

> but I can comment on the situation for Slovak and Russian, and
> believe me, being able to use unicode would be a godsend.

I agree that, except for the oriental languages and legacy systems,
unicode is just about perfect in its ability to represent scripts in
many languages.

Thanks,

-- 
Raul

Reply to:

Follow-Ups:
- Bug#99933: Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

References:
- Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

Prev by Date: Bug#100346: PROPOSAL] Do not mandate existence of shared libraries
Next by Date: Bug#100346: PROPOSAL] Do not mandate existence of shared libraries
Previous by thread: Bug#99324: Default charset should be UTF-8
Next by thread: Bug#99933: Bug#99324: Default charset should be UTF-8
Index(es):
- Date
- Thread