Re: support for multilingual Packages files?

To: debian-devel@lists.debian.org
Subject: Re: support for multilingual Packages files?
From: Tomohiro KUBOTA <tkubota@riken.go.jp>
Date: Fri, 03 Aug 2001 15:55:06 +0900
Message-id: <[🔎] 873d7963at.wl@surfchem0.riken.go.jp>
In-reply-to: In your message of "Tue, 31 Jul 2001 11:22:41 +0200" <20010731112241.A10770@melkor.dnp.fmph.uniba.sk>
References: <87u1ztvrsf.wl@surfchem0.riken.go.jp> <Pine.LNX.4.30.0107301332280.13957-100000@tennyson.netexpress.net> <038801c11921$38b0a380$ae4efea9@dvdeug> <20010731112241.A10770@melkor.dnp.fmph.uniba.sk>

Hi,

At Tue, 31 Jul 2001 11:22:41 +0200,
Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk> wrote:

>> However, the fact you need the mixture of languages does not mean that
>> people in the world need it.  Almost people in the world want to use
>> their own language.  Some others need to use other languages.
>
> Ok, let's dump those others. Majority of people needs just one language.
> And by the way, when we are already doing this, let's dump debian.
> Majority of people use RedHat anyway. Or dump the whole linux - 
> Almost all people in the world use windows anyway.

I don't understand your point.  My focus was that UTF-8 is better
for multilingual purpose but not almighty.  Sometimes other encodings
are better.  If we stop supporting non-UTF-8 locales and (it means)
force users to use UTF-8, such users will be annoyed.  If you really
want all people in the world to use UTF-8, please try to improve
weak points of Unicode.  Did you read my page
http://www.debian.or.jp/~kubota/unicode-symbols.html ?  Did I wrote
only about yen-sign problem?

It is freedom of users to use currently supported locales.  Developers
must not force them to change their encodings.  What developers can
do is to supply better UTF-8 support so that users come to think
about migration to UTF-8.

> double-width characters? It should not be UTF-8's problem, UTF-8 is a
> _text_ storage format, now WYSIWYG word processor format.

Do you know Unicode Standard Annex #11 East Asian Width?  I am saying
it is buggy.  http://www.unicode.org/unicode/reports/tr11/

> Round-trip compatibility with existing legacy encodings? Not a problem
> in UTF-8 itself. IMHO unicode could have been much more simpler if they 
> did not try to keep codepoints for all those legacy encodings.

The same thing is said for precombined Latin characters like "u"
with umlaut and so on.  They are introduced only for compatibility
with legacy encodings such as ISO-8859-*.

> Missing characters? Yes, that IS a problem. So write a proposal to
> unicode consortium. If they refuse it for no good reason, well, 
> then there IS a REAL problem.

When I talked about missing characters?  I imagine CJK Han Unification
problem is similar to missigng characters but it is a problem of a
definition of what is the identity of characters.  And more, we have
no hope that CJK Han Unification principle will be changed.  Thus, we
have to think about makeshift to distinguish CJK characters like
variant tags or so on.  Now it is not the theme.

> But we should not be discussing technical aspects of unicode here,
> this has already been flamed to death elsewhere.

Then do you tentatively agree that some people want to use non-UTF-8
locales?

> someone (I forgot who) else already wrote it: The main reasons why
> Japanese are against unicode is that they already have their own,
> well-working, national character encodings, and do not like the idea
> of changing it to something radically different.
>
> I believe he hit the issue directly :-)

There may or may not be some Japanese people who think so.  However,
it is not a focus.  Your sentence is only a seed of flame.

>> Good.  Thus, we need ASCII field for maintainers to write their
>> preferable ASCII name.
> Yes. Or Packages-ascii.

Ok.  Then I think we agreed that both UTF-8 and ASCII names are needed.
It is technically possible that UTF-8 name is translated into some
non-UTF-8 encodings if no characters from UTF-8 name fail to be
converted.  Thus, if you write UTF-8 name within the range of
ISO-8859-2 character set, this (non-ASCII name) will be used in
ISO-8859-2 locales.

>> For tentative purpose until Maintainer-utf8: field will be available,
>> you can use README.Debian file or so to put your correct name,
>> with -*- coding: foobar; -*- line at the first line.
>
> Or let's say default encoding of README.Debian is utf-8, and you use
> -*- coding: ascii; -*- as the first line for your packages. Well...
> since ascii is a subset of UTF-8, you even need not to use the line :-)

It is opposite.  It is possible that a file is written in ASCII encoding
and has a encoding specifier -*- coding: utf-8; -*- at the first line.
If you write a file in UTF-8, the specifier must be utf-8.

This is not related to a discussion about the preferable encoding for
README.Debian .  This is a basic usage of encoding specifier.  Imagine
a file in EUC-JP, and you use -*- coding: ascii; -*- as the first line
since ascii is a subset of EUC-JP (this is true).  I imagine you
understand this is obviously illegal.

> I never told this. I told that under b), maintainers can by themselves
> decide if they write ASCII-only version, UTF-8 only version, 
> or both.

I see.  I am happy to agree that we will have both ASCII and UTF-8
version of names.  Then, let's discuss further.  I think either of
them should be mandatory because we should have a policy for encoding
of Maintainer: field.  I also think ASCII version is a good candidate
for mandatory field.  Reasons:

1. I think the need of UTF-8 field is well understood by the maintainers
   themselves ("My name is not expressed well in ASCII!").  On the other
   hand, the need of ASCII field is come from the technical fact that
   various encodings are used in the world.

2. If ASCII field is not
   mandatory, transliteration method is left for softwares such as
   dselect.  Some softwares may use '?', some softwares may use some
   transliteration library, and so on so on.  This will surprise users
   who use such softwares because a maintainer's name differ when using
   different softwares even under same locale.

3. Generally maintainers don't test their softwares in different locale.
   Similarly, they don't imagine how their names are displayed in different
   locale.  They should know it.

4. Current and older versions of dselect and so on don't think about
   transliteration.  

I don't stick the name of the field, if ASCII version of name is
supplied systematically.  For example, Maintainer: field can be UTF-8
version if lintian issues an error if non-ASCII characters is used
for Maintainer: field and no Maintainer-ascii: field is supplied.

Of course UTF-8 name can contain not only diacritical Latin characters
but also any characters in Unicode, including Ideogram, Thai, Indic,
Hebrew, Arab, Hangul, Hiragana, Katakana, Cyrillic, Greek, and so on.

> Do not take me wrong, I am well aware of your reasons for ASCII only
> Packages, I just feel a bit more radical :-)

Radicality sometimes comes from ignorance of minority people.
However, unfortunately, the world is not so simple as you think.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/

Reply to:

Follow-Ups:
- Re: support for multilingual Packages files?
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

Prev by Date: Re: License Issues of Interchange Documentation
Next by Date: Unicode flame war (Was Re: Don't abolish non-unicode locales)
Previous by thread: Re: Unicode flame war (Was Re: Don't abolish non-unicode locales)
Next by thread: Re: support for multilingual Packages files?
Index(es):
- Date
- Thread