[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

UTF-8, CJK and file size



There's a question come to my mind during recent discussions about
UTF-8 handling.

UTF-8 preserves ASCII text, 1 byte for 1 character.  So a UTF-8 file
containing only ordinary english letters will be the same size (and in
fact, the same file) as the ASCII equivalent.  Latin accented
characters take up 2 bytes each, but since they are relatively sparse,
this won't lead to a huge increase in file size.

For the Asian CJK characters, however, UTF-8 typically uses 3 bytes per
character.  This is in contrast to the current national encodings
which use 2 bytes per character.  The files will, therefore, become
half again as big in size, and consequently their transmission over
the internet will take half again as long.

Being sensitive to potential accusations of Western imperialism, my
question is whether this file size increase is something that Asian
computer users have strong feelings about?  Will it be a major
stumbling block hindering the acceptance of the UTF-8 encoding?  Or is it
a non-issue?

Drew

-- 
PGP public key available at http://dparsons.webjump.com/drewskey.txt
Fingerprint: A110 EAE1 D7D2 8076 5FE0  EC0A B6CE 7041 6412 4E4A



Reply to: