[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: support for multilingual Packages files?



On Mon, Jul 30, 2001 at 06:58:20PM +0100, David Starner wrote:
> From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
...
> 
> > That is my problem - I cannot communicate well with ASCII only.
> > Not even with my ISO-8859-2 console I am (sort of) forced to use.
> > It is not just my name - to hell with it. But the inability
> > to mix languages freely when I need it drives me up the wall.
> 
> According to someone on slashdot (Alex Belitis(?)), you don't need to mix
> languages. So what's the problem? (-;

I did not realize :-)
Ok, I'll stay with Slovak only from now on, enjoy my next mails to
debian-devel :-)

> 
> But seriously, I understand your trouble. But I don't see where the Packages
> files is a big part of it. Better UTF-8 handling in X, better console/*term

It is just a small part of it.

> handling of UTF-8, better editor support for UTF-8, all of which would
> probably be more useful than worrying about the Packages file.
> 

Agreed. I am not saying I do not worry about these issues too.


On Mon, Jul 30, 2001 at 01:53:57PM -0500, Steve Langasek wrote:
> On Mon, 30 Jul 2001, Radovan Garabik wrote:
> 
> >>> 2) Localized fields in debian/control, such as Description-fr etc.
> >>> This is a different issue than 1), and has not been much discussed.
> >>> Probably the same way as debconf follows could be adopted.
> >>> Notice that even in English, there is an occasional need for
> >>> diacritics.
> 
> >> How is there a *need* for diacritics?  Most of the English-speaking world
> >> has
> 
> > Maintainer names. What is the discussion mostly about?
> 
> Rephrase.  How is there a need for diacritics *in English*?  This is what you
> asserted in the email I was replying to.  I maintain that English does not
> need diacritics.

Ok. There is a need for diacritics in (maintainers') names, even in
original English-only Packages file.
(whether the need is a big or a minor one, well, that can be a subject to
personal opinion)

> 
> > It should, but it could not. For translation effort not to look dumb,
> > there is a need for _proper_ maintainers names somewhere. I am trying
> > to put it into Packages. If you have other ideas, please tell.
> 
> I concede that it's useful to be able to represent Maintainer names in full
> Unicode; that is not in question.  What I disagree with is the argument that
> such non-ASCII characters should be included in existing fields of the Package
> file.

Well, your arguments indeed do have a point. I am more and more inclined
to the idea of canonical Packages-utf8 and separate 
Packages-en (== Packages.ascii == Packages)

> > Do not take me wrong, I am well aware of your reasons for ASCII only
> > Packages, I just feel a bit more radical :-)
> 
> We cannot change all the world's software in a day.  Being 'radical' is a
> disservice to our users, who need a system that continues to work between now
> and the day that UTF8 is available everywhere.
> 

debian users are well known for their ability to adapt the system to their
needs :-)
(otherwise they'd be running Mandrake or similar...)


On Tue, Jul 31, 2001 at 09:57:52AM +0900, Tomohiro KUBOTA wrote:
> Hi,
> 
> Sorry for a long mail...
> 
> At Mon, 30 Jul 2001 18:00:04 +0200,
> Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk> wrote:
> 
> > but national encodings are just that, national encodings. I cannot count
> > now many times I have been frustrated by this, when I needed to mix more
> > languages. UTF-8 helped a lot, at least on www pages I can use it freely
> > (apart from some support for old www browsers, but that is rather easy 
> > achievable when I can expect who is accessing my pages)
> 
> Do you know Mule and Emacs have been able to mix many languages
> for long years?  They are not based on UTF-8.  Ok, UTF-8 is one

Yes. Even M$ Word has been able to mix many languages for long years.
(yes, I know the difference between what word is and what emacs is).
Point is, that is a _proprietary_ (==one product only) encoding. Had
mule caught on and got widespread, situation might have been different. 
It had not. UTF-8 had instead.

> of such international encodings.


> 
> However, the fact you need the mixture of languages does not mean that
> people in the world need it.  Almost people in the world want to use
> their own language.  Some others need to use other languages.

Ok, let's dump those others. Majority of people needs just one language.
And by the way, when we are already doing this, let's dump debian.
Majority of people use RedHat anyway. Or dump the whole linux - 
Almost all people in the world use windows anyway.

> As I wrote at http://www.debian.or.jp/~kubota/unicode-symbols.html ,
> Unicode has problems on using Japanese and I am very sure that Unicode

double-width characters? It should not be UTF-8's problem, UTF-8 is a
_text_ storage format, now WYSIWYG word processor format.
Anyway: do you really think it is unicode's fault to specify 
single width cyrillic characters, in opposite to legacy Japanese
encodings, that specify them as double width? Who is wrong and
who is correct here?
Round-trip compatibility with existing legacy encodings? Not a problem
in UTF-8 itself. IMHO unicode could have been much more simpler if they 
did not try to keep codepoints for all those legacy encodings.
Missing characters? Yes, that IS a problem. So write a proposal to
unicode consortium. If they refuse it for no good reason, well, 
then there IS a REAL problem.

But we should not be discussing technical aspects of unicode here,
this has already been flamed to death elsewhere.

> cannot supply a usable solution unless MS or IBM would go bankrupt
> or would resign a membership of Unicode Consortium.
> 
> Thus, UTF-8 locales are needed for people who use language mix,
> while EUC-JP locale is needed for Japanese.  I don't know very well
> about other languages.

someone (I forgot who) else already wrote it: The main reasons why
Japanese are against unicode is that they already have their own,
well-working, national character encodings, and do not like the idea
of changing it to something radically different.

I believe he hit the issue directly :-)


> 
> 
> >> It is YOU who want to avoid confusion of characters with and without
> >> diacritics.  Why can you say that all people with Latin-script names
> >> want to use question mark than eliminating diacritics?
> > Not all. That would be up to the maintainers to decide what do
> > they want to do with their names. 
> 
> Good.  Thus, we need ASCII field for maintainers to write their
> preferable ASCII name.

Yes. Or Packages-ascii.

> 
> 
> > And? Do you know how to read Slovak letter "ch" ?
> > It consists of two pure-ASCII characters, no diacritics.
> 
> No, but we understand "ch" consists from two characters of "c" and "h".
> Nobody confuses "c" as "o".  However, for characters with diacritics,
> we really don't know them and we can confuse a character as a different
> character.  We may confuse acute, grave, and macron.  We may consider
> the difference of them as a difference between typeface, because we
> don't know these diacritics.  

So, when I write it in ascii only, you will not confuse it with
diacritics-less incorrect variant?

> 
> 
> [from other mails by Rabovan]
> 
> >> Yes, it should be ASCII.  ASCII is the common denominator that's present in
> > It should, but it could not. For translation effort not to look dumb,
> > there is a need for _proper_ maintainers names somewhere. I am trying
> > to put it into Packages. If you have other ideas, please tell.
> 
> For tentative purpose until Maintainer-utf8: field will be available,
> you can use README.Debian file or so to put your correct name,
> with -*- coding: foobar; -*- line at the first line.

Or let's say default encoding of README.Debian is utf-8, and you use
-*- coding: ascii; -*- as the first line for your packages. Well...
since ascii is a subset of UTF-8, you even need not to use the line :-)

> 
> > I did not say it is not bad. But if Tomohiro sees a random garbage kana or
> > a question mark in my name, I do not think it will be the end of the world.
> > And the same, if someone sees random ISO-8859-2 characters in place of his name,
> > it is not the end of the world.
> 
> Did you read how garbage can be?  I said it may break the whole screen
> (by scrolling).

yes, I read it. And yes, it is a problem. And yes, I would
be willing to live with it (I know you would not).
Anyway, if the Packages-utf8 and Packages-ascii idea catches on,
this problem would be solved.

> 
> Is it the end of the world that your name is written in ASCII
> characters without diacritics?  

It is the end of world that you cannot perfectly convert text from
EUC-JP to UTF-8 and back?

> I don't understand why you
> insist this problem is more important than garbage character
> problem.

And you insist that CJK unification and support for legacy characters
in UTF-8 is more important than single worldwide unified encoding unicode
promises.
I guess we are both equally stubborn :-)

> 
> 
> >>       disadvantages:
> >>       - maintainers who want to use non-ASCII characters are forced
> >>         to supply two versions of descriptions (or names,...).  However,
> > or they could decide if they prefer not to include the ASCII version at all,
> > so that nobody is confused by incorrect variant of their name (I am talking
> > now about latin-script names with diacritics)
> 
> In my idea, maintainers are free to include '?' in ASCII field in
> such cases, just I wrote before.  Of course maintainers are free
> to choose 'ue' or 'u' (or even literally '&uuml;' or '\"u' or even
> 'foobar') for &uuml; .  On the other hand, in your idea b) (require
> using utf-8), maintainers cannot control how their own name is
> displayed when their local characters are not available.  Also,

I never told this. I told that under b), maintainers can by themselves
decide if they write ASCII-only version, UTF-8 only version, 
or both.

> even if you choose '?', you can explicitly show your will to use
> '?' by supplying ASCII field.
> 
> I don't understand why you don't like supplying ASCII version
> of your name.  It seems that you just don't want to face up

<irony>
Well, I do not uderstand why Japanese do not switch completely to Kana.
This would make CJK unification a non-issue, faciliate teaching of
japanese writting for children, kana is perfectly established in
Japanese society and this will tremedously simplify software dealing
with japanese.
</irony>

> to the reality that not every people in the world cannot read
> your diacritics.

But the technology is there. You do realize how enourmously powerful
are contemporary computers. I worked on 8-bit computers and an old JSEP
minicomputers. They were capable of 7-bit ASCII. Then IBM PC/XT/AT came,
they were capable of 8-bit national codes (Slovak letters, finally).
Compare those computers with current P3, P4. And we cannot make them
work with all existing scripts in the world? Shame.

-- 
 -----------------------------------------------------------
| Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__    garabik @ melkor.dnp.fmph.uniba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



Reply to: