[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: support for multilingual Packages files?



Alert: this mail is in UTF-8

On Sat, Jul 07, 2001 at 01:17:54AM +0900, Tomohiro KUBOTA wrote:
> Hi,
> 
> At Fri, 6 Jul 2001 16:11:59 +0200,
> Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk> wrote:
> 
> > But xterm is better and getting better all the time.
> 
> Agreed.  Xterm is the best Unicode software as far as I know.
> It supports doublewidth characters and combining characters.
> There is a patch to enable bidi (Arab/Hebrew) and high planes
> (above U+10000).  However, we are now discussing about Packages
> file, which is used on installation process.  Thus, we definitely
> need UTF-8 support by Linux console if we want to use UTF-8 for
> Packages file.

Not really, we just need to assure base packages are readable
on limited utf-8 console (512 glyphs at once - this covers almost
all languages except CJK, and japanese can theoretically get along
with using hiragana and katakana - am I right?). We are not going to solve
problems with Kanji or Han characters during installation process,
that would require substantial effort anyway.

> 
> > > No.  Someone may want to use dselect under LANG=de_DE.ISO-8859-1,
> > > LANG=th_TH.TIS620, or LANG=ru_RU.KOI8-R locales.  Then it will
> > > fail to display my name.  It is just like I use dselect under
> > > LANG=ja_JP.eucJP and it fails to display ISO-8859-1 letters.
> > 
> > This is glibc deficiency. Decent i18n would provide suitable
> > transliteration (try this command:
> > filterm - isolatin1-ascii
> > and display some latin1 texts - if something similar
> > would be in glibc, half of the problem would be gone)
> 
> I cannot think about algorithmical transliteration from
> CJK Ideogram to Latin Alphabets.  One Ideogram often has
> several readings (at least in case of Japanese).

But kana is straitforward, isn't it? 
OK, you have a point. But, when somebody 
has non-kanji enabled terminal, he cannot see it anyway,
so the choice is either to not provide kanji at all,
and then he cannot see it, or provide kanji, and then
he cannot see it too :-)

...
> Like it, LANG=ja_JP.eucJP and dselect will display Latin/Greek/Cyrillic
> alphabets and a part of CJK Ideogram but will fail to display
> Thai/Hebrew/Arab letters (or transliteration may be available).
> LANG=sk_SK.ISO-8859-2 and dselect will display Latin alphabets
> and transliteration of Greek/Cyrillic alphabets and fail to
> display CJK Ideogram.
> 
> I admit that it is nice.  However, before we dream of nice system,
> we should think about error-free system.
> 

this is just a "get along" system, before utf-8 is universally
accepted and all major issues solved

> I think that two fields of Maintainer: and Maintainer-utf8:
> in Packages file can be a solution.
> 

could be... but I though about the something like this:
(taken from my proposal):

Names of maintainers, upstream authors and other data in
packages' descriptions and related debian data files (such as
debian/changelog, debian/copyright,
debian/control), as well as in English language
documentation, should be either transliterated or
transcribed to ASCII, or used in UTF-8 encoding at the
discretion of the maintainer. However, for names
in scripts based on non-latin alphabets, ASCII (or suitable
latin-script) version should be provided along with original
name.

So the offending line would look similar to this:
(I borrowed someone's name, hope he does not mind :-))

Maintainer: "Антон Зиновиев (Anton Zinoviev)" <email@here>

people with proper UTF-8 terminal will see it correctly
all. If they can read cyrillic, they know the proper name.
If they cannot, they see latin transcription, and have
nice warm feeling that their terminal is able to display cyrillic.

People with transliterating terminal see this:
Maintainer: "Anton Zinoviev (Anton Zinoviev)" <email@here>
no harm was done, they have the best they can get without
cyrillic terminal.

people with non-translating terminals see this:
Maintainer: "????? ???????? (Anton Zinoviev)" <email@here>
not much harm, they cannot see original name, but if it
were not there in the first place, they would not see it
anyway, right?

Well, staying with plain ASCII has another disadvantage:
if I knew not Anton was Bulgarian, I would have difficult
time trying to find out what really is his name.
Зиновиев, Зиновиевь, Зиновъевъ, Зиновьев, Зіновїев, Зиновјев,
or any combination of these.... 
(and I am not sure I got it right anyway)
> 
> 
> > Sure, but translated from _what_ ?
> > 
> > We have one original, and translations. Now by translations I do not
> > mean only language translations, but also charset changes. One of the
> > translations can be english.ASCII (or call it whatever you want).
> 
> Right.  The requirement for the original message is that it must be
> easily read by people all over the world.  Thus, English language is
> the most proper candidate (for some historical/political reasons.  Not
> Esperanto).  And more, I think usage of non-Latin letters (like
> Cyrillic, Greek, CJK Ideogram, Thai, Hebrew, Arab, and so on) should
> be avoided unless very strong reason to use them.  This is because
> Latin Alphabet is the only scripts which we can expect people in the

and this is another point: what we call latin alphabet is not just ascii,
it is also ISO-LATIN-* with diacritics. These letter do not pose
problems to those already familiar with English (not latin)
since latin alphabet has no J,U,W,Y,Z letters) alphabet.

(btw the term "latin" is misleading since original latin alphabet has no
J,U,W,Y,Z letters)

> world to be able to manage to read it.  Though writing developers'
> names in their native letters sounds fantastic, it implies a risk
> that it cannot be read by people in the world.

There are two risks: technical and linguistic.
Technical means software would not be able to display it. 
This should be solved by fixing software, not by mangling names
(heck, we have now 100x faster computers with 100x more RAM than
the first hardware linux ran on, and the situation would be really sad
if we could not write software able to display 100x more characters!)
All accented letters belong to this category.

Linguistic issue means very few people are able to read more scripts,
like cyrillic or kanji. In this case, latin version should be included 
as well.

> 
> 
> > Original database should contain all the information (why constrain
> > yourself?), and messages translated to ASCII user's locale will be
> > transliterated to ascii - preferrably on the fly, via improved iconv().
> 
> Original database should be written in English.  Ok, someone might want

Original should be original. English should be used, right, but I do not
mind having foreign-language descriptions for packages that are not
useful for people not speaking that particular language (preferrably
together with English translation, of course).

> to use non-Latin alphabets in very limited cases.  In such a case,
> the writer of the database will have to choose one of two risks:
> (1) use non-Latin alphabets but people in the world may not be able
> to read them or algorithmic transcription may fail, or (2) give up
> to use non-Latin alphabets.

Most common case are names. It is up to maintainer's common sense
to decide if he uses just ASCII version, or correct version
(in case of latin script with diacritics) and take the risk,
or includes both as well, or, for non-latin scripts, includes
both or just latinized version.

> 
> I imagine every writers want their messages to be read by as possible
> as many people in the world.  Thus, I write my signature in Latin
> alphabet.  Debian developers who use ISO-8859-1 letters in their
> Maintainer: field and who don't discuss i18n support are, I imagine,
> simply ignorant about this problem.  I don't think they have a certain
> will to use non-ASCII letters and to improve i18n around Debian.

Agreed. Though many have the will, they are just unaware
of the problems they are causing


でわまた 

-- 
 -----------------------------------------------------------
| Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__    garabik @ melkor.dnp.fmph.uniba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



Reply to: