Bug#99933: Bug#99324: Default charset should be UTF-8

To: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
Cc: 99933@bugs.debian.org
Subject: Bug#99933: Bug#99324: Default charset should be UTF-8
From: Raul Miller <moth@debian.org>
Date: Mon, 11 Jun 2001 12:34:40 -0400
Message-id: <[🔎] 992273294.21bafc15@debian.org>
Reply-to: Raul Miller <moth@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20010611164718.A25953@melkor.dnp.fmph.uniba.sk>; from garabik@melkor.dnp.fmph.uniba.sk on Mon, Jun 11, 2001 at 04:47:18PM +0200
References: <[🔎] 20010611104113.A15114@melkor.dnp.fmph.uniba.sk> <[🔎] 20010611090721.A12776@usatoday.com> <[🔎] 20010611164718.A25953@melkor.dnp.fmph.uniba.sk>

On Mon, Jun 11, 2001 at 04:47:18PM +0200, Radovan Garabik wrote:
> my proposal is #99933

Thanks.

> does JIS X0208 allow chinese characters to be used together with
> japanese?

I don't think so.

However, JIS X0208 implies a japanese character set and the
japanese language, while unicode indicates no such thing.

> The situation is IMHO quite similar to german for using Fraktur
> (Sütterlin) script - it is a latin script, and unicode consortium
> (IMHO rightfully) decided that it is a typesetting difference - not an
> encoding one (you can - and sometimes you do - typeset english text
> using Fraktur fonts, after all). If Germans were using it still today,
> you would have exactly the same problems as with CJK scripts now (of
> course, the complexity of CJK is much greater than that of a latin
> scripts)

I disagree.  The Han Unification issue is more like the difference
between the latin and the italic character sets.  Yes, many characters
are similar, however there are also some characters which are unique to
each representaiton.

Also, Unicode does include Fraktur characters.

> I am really not sure if unicode went the right way, I feel the ability
> to display Chinese name in a Japanese document using Chinese glyphs
> (or vice versa) is something that should not be get rid of... 

And, this could be rectified -- with Unicode 3.1, they have the code
space to represent each major representation of the character set.

> perhaps it should consider them to be different scripts with different
> encodings, but  when would it stop? Making italics, boldface etc. to be
> different characters?

Unicode already does that.  Take a look at the mathematical alphanumeric
symbols [1D400-1D744].  For example:
1D400 MATHEMATICAL BOLD CAPITAL A
1D41A MATHEMATICAL BOLD SMALL A
1D434 MATHEMATICAL ITALIC CAPITAL A
1D44E MATHEMATICAL ITALIC SMALL A
1D468 MATHEMATICAL BOLD ITALIC CAPITAL A
1D482 MATHEMATICAL BOLD ITALIC SMALL A
1D49C MATHEMATICAL SCRIPT CAPITAL A
1D4B6 MATHEMATICAL SCRIPT SMALL A
1D4D0 MATHEMATICAL BOLD SCRIPT CAPITAL A
1D4EA MATHEMATICAL BOLD SCRIPT SMALL A
1D504 MATHEMATICAL FRAKTUR CAPITAL A
1D51E MATHEMATICAL FRAKTUR SMALL A
1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A
etc. etc.

> > [4] (may just be my ignorance) I don't know if Debian has a full set of
> > Unicode fonts to properly represent text in the various major oriental
> > languages.
> 
> You cannot display all of them at the console anyway.

console vs. x is a not a character set issue.  Note that console has
other limitations (fixed width, uni directional).

> As for X11, fonts are being rapidly developped.

For currently relevant policy it matters what actually works.

> > > Well, there is one issue I thought of... package can include
> > > documentation in different encodings (such as README.koi8,
> > > README.ascii, README.alt). This should be allowed. Perhaps the
> > > sentence "Package may (at the discretion of the maintainer) include
> > > documentation files in other encodings, if they are present also in
> > > canonical encoding, and if the encodings used are clearly marked"
> > > should be added to the proposal?
> > 
> > This sounds like a very good idea.
> >
> > For languages and contexts where a specific font is required to properly
> > read Unicode, it would also be a good idea to clearly indicate that font.
> 
> Maybe.. but just let's do not overcomplicate things :-)
> > 
> > How about:
> > 
> >  "Package may (at the discretion of the maintainer) include
> >   documentation files in other encodings, if they are present also in
> >   canonical encoding, and if the encodings used are clearly marked. 
> >   If a particular font is required, that should be clearly marked."
> 
> You do not know what is a particular font... one of 
> (traditional|simplified)C,J,K, or the full font name?

I'm not sure I understand this question (I don't know enough about
oriental languages and fonts to give a full answer in any event).

> > > I would not comment about the situation in Japan, since I obviously
> > > know nothing about it (although common sense says it is better to have
> > > one encoding instead of several incompatible ones),
> > 
> > That could just as easily be an argument for ascii over latin-1.
> 
> not really, since ascii cannot be used to display the particular language
> (take slovak or russian).

latin-1 doesn't solve this problem so that's a non-issue.

> More appropriate example from the history is the war between EBDIC,
> ASCII and other proprietary encodings... thanks god one and only one
> encoding won.

ebdic vs. ascii wasn't about supported languages.

> > I agree that, except for the oriental languages and legacy systems,
> > unicode is just about perfect in its ability to represent scripts in
> > many languages.
> 
> and that is something terribly needed today, with this
> world wired together.

I agree.

However, Unicode is not a mature standard, so we need to be careful in
places where it would cause problems.

Thanks,

-- 
Raul

Reply to:

Follow-Ups:
- Bug#99933: Bug#99324: Default charset should be UTF-8
  - From: Florian Weimer <Florian.Weimer@RUS.Uni-Stuttgart.DE>
- Bug#99933: Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

References:
- Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Bug#99324: Default charset should be UTF-8
  - From: Raul Miller <moth@debian.org>
- Bug#99933: Bug#99324: Default charset should be UTF-8
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

Prev by Date: Bug#99933: Bug#99324: Default charset should be UTF-8
Next by Date: Bug#99933: Bug#99324: Default charset should be UTF-8
Previous by thread: Bug#99933: Bug#99324: Default charset should be UTF-8
Next by thread: Bug#99933: Bug#99324: Default charset should be UTF-8
Index(es):
- Date
- Thread