[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8, CJK and file size



On Wed, Jul 11, 2001 at 03:50:02PM +0100, Richard Kettlewell wrote:
> Drew Parsons writes:
> 
> > For the Asian CJK characters, however, UTF-8 typically uses 3 bytes
> > per character.  This is in contrast to the current national
> > encodings which use 2 bytes per character.  The files will,
> > therefore, become half again as big in size, and consequently their
> > transmission over the internet will take half again as long.
> 
> It would be interesting and relevant to know what the typical size
> difference between a gzip'd CJK-encoded document and the equivalent
> gzip'd UTF-8-encoded document is.
> 

Good question. (and this seems to be a good point to mention that there is
an "official" UTF-8 compression scheme).

It also raises another interesting question: how does the size of a
standard Asian text compare to a Western one?

A typical novel in English is maybe 300 pages long. Suppose there's
about 35 lines per page and 50 odd letters per line (judging from the
novel I'm currently reading).  That's about 500 KB for one novel.  How
many bytes would a Japanese novel take up?

Drew

-- 
PGP public key available at http://dparsons.webjump.com/drewskey.txt
Fingerprint: A110 EAE1 D7D2 8076 5FE0  EC0A B6CE 7041 6412 4E4A



Reply to: