[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8, CJK and file size



On Wed, 11 Jul 2001, Tomohiro KUBOTA wrote:

> > For the Asian CJK characters, however, UTF-8 typically uses 3 bytes per
> > character.  This is in contrast to the current national encodings
> > which use 2 bytes per character.  The files will, therefore, become
> > half again as big in size, and consequently their transmission over
> > the internet will take half again as long.

> This is true.  (For Russians, Greeks, and Thai, it becomes
> twice - larger than CJK.)

> > Being sensitive to potential accusations of Western imperialism, my
> > question is whether this file size increase is something that Asian
> > computer users have strong feelings about?  Will it be a major
> > stumbling block hindering the acceptance of the UTF-8 encoding?  Or is it
> > a non-issue?
>
> IMO, such "file size" problem is a relatively minor problem
> compared with many problems of Unicode and UTF-8,

Yes, the 'file size' problem is not a non-issue in general, but for our
specific potential uses of Unicode it is probably not that much of a problem:
maintainer names account for a small percentage of the size of the Packages
file, which is also available gzipped; and language-specific Packages files
(with localized package descriptions) can also be gzipped: it's a good point
that, with gzip compression, the resulting file sizes for different encodings
of a single language should be comparable.

> like CJK Han Unification (Japanese people generally think that
> Unicode unified not only very similar characters from CJK
> countries but also characters with significant difference),
> lacking of unified Unicode <-> JIS X 0208 conversion table (see
> http://www.debian.or.jp/~kubota/unicode-symbols.html .
> this will never be solved because of political problem --
> confrontation among venders like MS and IBM), "EastAsianWidth"
> bugs which is strongly related to the above political problem
> (see the above web page), and so on.

> Even though, I admit UTF-8 is useful and is one of the most
> important encodings in future.  (This is why I publicize the
> problems on Unicode and UTF-8 -- Using Unicode without knowing
> problems is a foolish behavior.)

I hold the optimistic view that these are simply bugs in Unicode that will be
worked out over time.  There is much to be gained by making sure Unicode meets
the needs of *everyone* in the international community, not just westerners.
If we have to continue using other encodings besides Unicode because it can't
represent some languages properly, then Unicode fails all of us, not just
Japanese speakers.  In particular, it fails software developers, because it
means we'll forever have to support numerous different character sets instead
of the single universal character set Unicode promised us.

One way or another, universal adoption of Unicode is years away.  In the
meantime, starting to use UTF-8 for Debian in appropriate ways is better than
being stuck with ASCII-only, IMHO.

Steve Langasek
postmodern programmer



Reply to: