[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8, CJK and file size



Hi,

At Wed, 11 Jul 2001 00:33:47 +1000,
Drew Parsons <dparsons@emerall.com> wrote:

> For the Asian CJK characters, however, UTF-8 typically uses 3 bytes per
> character.  This is in contrast to the current national encodings
> which use 2 bytes per character.  The files will, therefore, become
> half again as big in size, and consequently their transmission over
> the internet will take half again as long.

This is true.  (For Russians, Greeks, and Thai, it becomes
twice - larger than CJK.)


> Being sensitive to potential accusations of Western imperialism, my
> question is whether this file size increase is something that Asian
> computer users have strong feelings about?  Will it be a major
> stumbling block hindering the acceptance of the UTF-8 encoding?  Or is it
> a non-issue?

IMO, such "file size" problem is a relatively minor problem
compared with many problems of Unicode and UTF-8, like CJK
Han Unification (Japanese people generally think that Unicode
unified not only very similar characters from CJK countries
but also characters with significant difference), lacking of
unified Unicode <-> JIS X 0208 conversion table (see
http://www.debian.or.jp/~kubota/unicode-symbols.html .
this will never be solved because of political problem --
confrontation among venders like MS and IBM), "EastAsianWidth"
bugs which is strongly related to the above political problem
(see the above web page), and so on.

The size problem will be solved by increasing size of disks.
(exceptions are embedded systems which have limited memory.)

Even though, I admit UTF-8 is useful and is one of the most
important encodings in future.  (This is why I publicize the
problems on Unicode and UTF-8 -- Using Unicode without knowing
problems is a foolish behavior.)

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/



Reply to: