Re: Multibyte encoding - what should a package provide?
kubota> Please note, Unicode is not popular at all in Asia. I am sure
kubota> there are very very few people using Unicode in Japan. Instead,
kubota> EUC-JP is popular for UNIX and SHIFT-JIS is the OS's coding
kubota> system for Windows/Macintosh in Japan. I guess EUC-KR is popular
kubota> in Korea (Am I right? -- I guessed from http://www.debian.org/index.ko.html).
i think it might help if the reasons for not liking uniocde were
spelled out.
would anyone care to take this up? even a reference specifying the
reasons in an asian language would be good for starters (someone can
translate it :-) ).
i've spent some time looking at this issue recently and i'm still not
certain of the reasons for the dislike. here is my current
understanding. please correct any mistakes.
-there appear to be quite a few people who confuse unicode w/
iso 10646 (i certainly didn't know the difference until i looked into
it :-) ) -- if i understand correctly, unicode is pretty much
a subset of iso 10646 -- it's basically ucs2, a fixed-width
character set. (there are different versions of unicode, so i presume
one needs to be careful around phrases such as 'supports unicode')
-ucs2 uses 16 bits -- that translates into about 65000 characters.
this is not enough characters to cover all asian languages. perhaps
some actual numbers would be convincing :-) assuming this is true,
i think i see a reason for disliking unicode -- it doesn't appear to
be enough for everybody.
-there is at least one other part of iso 10646 called ucs4 -- it uses
31 (yes, 31) bits per character. this provides about 2 billion
characters -- has anyone heard whether this is regarded by anyone as not
enough?
-last i heard, the only part of ucs4 defined is essentially what is
defined in ucs2<->unicode (perhaps old info). this (combined w/ some
earlier statements) means that iso 10646 is not necessarily the answer
for (from what i gather) a fair number of folks who must deal w/ the asian
locale (yet?) -- there might be enough slots for characters, but if the
ones you need aren't there yet, it's not very helpful. perhaps it's
a matter of time?
-the current approach in unicode and iso 10646 is to treat certain
characters (appearance - glyphs?) from different languages as the same
character (byte representatoin - code point?). supposedly, 'similar-
looking enough' (for some definition) characters are treated as the same
character.
the most often cited example i hear of is for kanji (roughly,
ideographs) -- some kanji from different locales are treated as
identical. however, this is also true of characters used in
european languages. you can't tell an italian 'a' apart from an
english 'a' just by looking at the individual characters. the approach
appears to be at least consistent in this fashion.
you might not care much in the case of 'a' because an italian
'a' looks the same (last i checked) as an english 'a'. but in the case
of kanji, i believe this doesn't necessarily hold. it's hard to
give a concrete example in ascii text -- perhaps someone who is
more familiar w/ the issues (or artistic?) can put up some .png images
somewhere to illustrate this point.
i think this has consequences for trying to display documents that
contain characters from multiple languages -- for each character, how
do you decide which font to use if a character can be from several
different languages (and looks different depending on language?).
you can probably come up w/ elaborate systems to deal w/ this, but it
is not a simple matter of choosing a font based on only looking at
each individual character.
note that for european languages, it appears that no extra processing
would be necessary for display because the characters look similar
enough.
it might help for someone who knows better to also explain that asian
languages are basically getting no (or very little) backward
compatibility w/ existing encoding methods. (e.g. for japanese, if
you were using euc-jp, iso-2022-jp, or shift-jis (ugh) before, you
basically have to use tables to convert to ucs2/ucs4 -- there are no
'nice' transformations)
perhaps someone who knows better can explain utf8 (a transformation
that can be performed on ucs2, ucs4, and utf16?) and utf16 (a way of
using parts of ucs2 and ucs4 together?).
here is a question for folks in-the-know, is using uft8 on utf16 seen
as not enough to deal w/ asian locales? even once ucs4 becomes more
fully specified?
Reply to: