Bug#99933: Bug#99324: Default charset should be UTF-8

To: Raul Miller <moth@debian.org>, 99933@bugs.debian.org
Subject: Bug#99933: Bug#99324: Default charset should be UTF-8
From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
Date: Mon, 11 Jun 2001 16:47:18 +0200
Message-id: <[🔎] 20010611164718.A25953@melkor.dnp.fmph.uniba.sk>
Reply-to: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20010611090721.A12776@usatoday.com>; from moth@debian.org on Mon, Jun 11, 2001 at 09:07:21AM -0400
References: <[🔎] 20010611104113.A15114@melkor.dnp.fmph.uniba.sk> <[🔎] 20010611090721.A12776@usatoday.com>

On Mon, Jun 11, 2001 at 09:07:21AM -0400, Raul Miller wrote:
> On Mon, Jun 11, 2001 at 10:41:13AM +0200, Radovan Garabik wrote:
> > Please read the proposal carefully (especially Marco and Junichi).
> > Writting (converting into) documents in UTF-8 is "should"
> 
> I'm quite aware of that.  [There's also a "should" on using a single
> character set within a package.]
> 
> Unfortunately, I can't find a copy of the proposal right now.  I searched
> my debian-policy mail box for recent policy that might be it, and I
> searched the the entire debian bugs archive 99324.  I know I saw this
> most recent policy, but I can't find it.  Do you know where it is?

Mea culpa...
by mistake I sent my previous mail to 99324@bugs.debian.org, which was 
Cesar Eduardo Barros's proposal about full unicode support,
while my proposal is #99933

> 
> Anyways, the point of my earlier questions was whether a better policy
> could be written.  I agree that the current policy proposal won't
> introduce any policy bugs, but that's not the only issue.
> 
> > (and IMHO should be "must" but debian maintainers seem not to be
> > ready for that yet)
> 
> It's not that just that debian maintainers are not ready.  Unicode isn't
> ready for oriental languages:
> 
> [1] Unicode is missing the code points to distinguish between
> different languages.  [In the case of Japanese, for instance,
> JIS X0208 is also flawed, but defines more Japanese characters

does JIS X0208 allow chinese characters to be used together
with japanese?

> than Unicode does.]  This means that only in combination with
> a particular set of unicode fonts does unicode work right.  [See
> http://www.debian.or.jp/~kubota/unicode-unihan.html for a description
> of the problems of one unicode font in the context of Japanese.]
> 

I am aware of unicode problems for CJK.
There are actually two problems:
the first one, which was emphasised and used as the main argument
against unicode is in fact the less important:
unicode is not complete for CJK. That is (relatively :-)) easy
to fix, just write a proposal and get it accepted....

The second problem reflects a fundamental design decision:
Unicode unifies Chinese (traditional and simplified), Korean and
Japanese characters, and because of differencies in glyphs,
it means using appropriate font is required to view the text
properly.
The situation is IMHO quite similar to german for using Fraktur 
(Sütterlin) script - it is a latin script, and unicode
consortium (IMHO rightfully) decided that it is a typesetting
difference - not an encoding one (you can - and sometimes you do -
typeset english text using Fraktur fonts, after all). If Germans
were using it still today, you would have exactly the same problems
as with CJK scripts now (of course, the complexity of CJK is
much greater than that of a latin scripts)

Or, similar example, I was reading a linguistic book in Russian,
and there were examples from Old Church Slavonic. To distinguish them
from normal text, they were typeset in a different font, using actual
ancient glyphs - again, according to unicode this is a typesetting
change, not an encoding one (it is cyrillic all the way)

I am really not sure if unicode went the right way, I feel the ability
to display Chinese name in a Japanese document using Chinese glyphs
(or vice versa) is something that should not be get rid of... 

perhaps it should consider them to be different scripts with different
encodings, but  when would it stop? Making italics, boldface etc. to be
different characters?

> [2] Unicode is (was?) missing code points to express all the characters
> used within the language (this is particularly significant with names --
> I don't know if there are other contexts where this is important).
> 
> [3] (less important for debian documentation) Unicode is not 8 bit clean.
> [In the case of Japanese, for instance, EUC-JP is a 7-bit a instance of
> JIS X0208 which supports the common subset of the JIS X0208 characters).
> 
> [4] (may just be my ignorance) I don't know if Debian has a full set of
> Unicode fonts to properly represent text in the various major oriental
> languages.

You cannot display all of them at the console anyway.
This is for a future.
As for X11, fonts are being rapidly developped.

> 
> > only for debian control files and English language documentation
> > (if any non-english characters occur there).
> > For documentation in other languages, it is merely an encouragement.
> 
> I don't remember anything in that policy that made specific reference
> to any language.

It was there, at the end.

> 
> > Well, there is one issue I thought of... package can include
> > documentation in different encodings (such as README.koi8,
> > README.ascii, README.alt). This should be allowed. Perhaps the
> > sentence "Package may (at the discretion of the maintainer) include
> > documentation files in other encodings, if they are present also in
> > canonical encoding, and if the encodings used are clearly marked"
> > should be added to the proposal?
> 
> This sounds like a very good idea.
> 
> For languages and contexts where a specific font is required to properly
> read Unicode, it would also be a good idea to clearly indicate that font.
> 

Maybe.. but just let's do not overcomplicate things :-)

> 
> How about:
> 
>  "Package may (at the discretion of the maintainer) include
>   documentation files in other encodings, if they are present also in
>   canonical encoding, and if the encodings used are clearly marked. 
>   If a particular font is required, that should be clearly marked."

You do not know what is a particular font... one of 
(traditional|simplified)C,J,K, or the full font name?

> 
> J?rgen A. Erhard <juergen.erhard@gmx.net> wrote:
> > >     >*Addition to 13.5 Preferred documentation formats:
> > >     >
> > >     >HTML documents, if in encoding other than us-ascii, must have
> > >     >in their header an appropriate META tag describing the used
> > >     >encoding.
> > >
> > > Shouldn't that be "iso-8859-1 (latin1)" instead of "us-ascii"?  As,
> > > IIRC, that is the official default encoding for HTML (according to
> > > RFC2854/RFC2616).
> > 
> > It used to be, in HTML 2.0, but HTML 4.0 says it is ISO 10646 (but
> > does not tell if UTF-8 or UTF-16 or even directly UCS-4)
> 
> Note that Unicode 3.1 defines a mapping between UTF-8/UTF-16 and UTF-32,
> and that UTF-32 is essentially just UCS-4 with Unicode semantics (21
> significant bits of characters).  
> 
> Note, however, that there's a problem with X where font metrics on ISO
> 10646 fonts occupies at least a megabyte in the application (because of
> the low-level data structure currently used in xlib for font metrics).
> 
> > I would not comment about the situation in Japan, since I obviously
> > know nothing about it (although common sense says it is better to have
> > one encoding instead of several incompatible ones),
> 
> That could just as easily be an argument for ascii over latin-1.
> 

not really, since ascii cannot be used to display the particular language
(take slovak or russian). More appropriate example from the history
is the war between EBDIC, ASCII and other proprietary encodings...
thanks god one and only one encoding won. The situation repeats itself,
we have 2 competing encodings in Slovak, 3 in Russian.. and if
we want one of them to win, why not make the winner unicode, which has the 
indisputable[1] advantage of being unified for the whole world?

[1] of course, problems with CJK remains and has to be addressed

> I think it's better to address the specific faults of unicode.
> 
> However: I also do not know much about the situation in Japan.  Nor do
> I know about the situation in Korea.  Nor do I know about the situations
> in China.  To properly address this issue, we need the advice of people
> who are familiar with each area -- even though some may not speak english,
> and may not subscribe to debian policy.
> 
> > but I can comment on the situation for Slovak and Russian, and
> > believe me, being able to use unicode would be a godsend.
> 
> I agree that, except for the oriental languages and legacy systems,
> unicode is just about perfect in its ability to represent scripts in
> many languages.
> 

and that is something terribly needed today, with this
world wired together.

-- 
 -----------------------------------------------------------
| Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__    garabik @ melkor.dnp.fmph.uniba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Reply to:

Follow-Ups:
- Bug#99933: Bug#99324: Default charset should be UTF-8
  - From: Raul Miller <moth@debian.org>

References:
- Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Bug#99324: Default charset should be UTF-8
  - From: Raul Miller <moth@debian.org>

Prev by Date: Bug#100346: PROPOSAL] Do not mandate existence of shared libraries
Next by Date: Bug#99933: Bug#99324: Default charset should be UTF-8
Previous by thread: Bug#99324: Default charset should be UTF-8
Next by thread: Bug#99933: Bug#99324: Default charset should be UTF-8
Index(es):
- Date
- Thread