[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8, CJK and file size



Steve Langasek <vorlon@netexpress.net> writes:
> Yes, the 'file size' problem is not a non-issue in general,

Why is it not a non-issue? Everyone changing to XML is probably worse than a
50% growth on data size. Most of the space consuming files on my hard drive
are music, pictures, and source code (which is pretty much all ASCII), not
text. The entire output of Project Gutenberg (sans the Human Genome) fits on
one 640 MB CD with a good 150 MB to spare.

If the file size is a problem, then implement SCSU, a Unicode specific
compressor that gets alphabetic text down to about 1 byte per character, and
CJK characters to about 2 bytes per character - the sample Japanese text is
just over 1.5 bytes per character, better than any existing encoding. Better
yet, SCSU compressed text can still be compressed with gzip to get a
compression ratio better than either alone. I'll have to pound on Ngeadal
some more so we can get a decent SCSU encoder in Debian.

> I hold the optimistic view that these are simply bugs in Unicode that will
be
> worked out over time.

But they won't. The unstandardized conversion tables probably won't get
standardized because they work fine in isolation. The Adobe/Microsoft group
that makes up a lot of the Unicode people don't use terminal emulators, and
aren't going to encode some complex solution for terminal widths. And asking
the Unicode people change how ideographs are unified is like asking Debian
to change to RPM - a lot of digitial ink has been wasted on it, and making
the change would gut one of fundamental principles of Unicode.

> There is much to be gained by making sure Unicode meets
> the needs of *everyone* in the international community, not just
westerners.

Please don't phrase it like this. The Chinese, the Arabs, the Africans and
the Indians (both from America and India) all seem to have their needs met
by Unicode. Just because the loudest complainers about the Unicode standard
are the Japanese, doesn't mean that Unicode is a West versus East thing.
Japan's main problem, IMO, is that Japan has large, complex pre-existing
character coding systems and Unicode is built fundamentally different from
them, and meshing them is more complex.

--
David Starner - dstarner98@aasaa.ofe.org



Reply to: