[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8, CJK and file size



On Thu, Jul 12, 2001 at 11:23:24PM +1000, Drew Parsons wrote:
> ...
> A typical novel in English is maybe 300 pages long. Suppose there's
> about 35 lines per page and 50 odd letters per line (judging from the
> novel I'm currently reading).  That's about 500 KB for one novel.  How
> many bytes would a Japanese novel take up?

That's a trick question, in a way. Because while an english word
may be anywhere from 1-10 letters, with an average length of maybe 5
letters; a japanese "word" has an average of about 2 "letters".
(dont forget 1 kanji => 1 word )
Except that you then get conjugation, so there may be an extra "letter"
or two thrown in there.

So, byte-wise, they should be roughly the same.
5 english letters == 5 bytes per word average
2.5 japanese "letters" == 2.5 unicode characters average ==5 bytes/word av.



Reply to: