[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Fwd^2: Re: rendering unicode han

>>>>> "Graydon" == Graydon Hoare <gray@interlog.com> writes:

    Graydon> I do not believe that unicode is hopelessly
    Graydon> unrecoverable. It is a mess, true, but text processing in
    Graydon> general is a mess. The alternative to unicode (or 10646,
    Graydon> which imo is the same idea) is much worse, and we should
    Graydon> simply correct mistaken unifications rather than throwing
    Graydon> our hands up and returning to having to make different
    Graydon> versions of each software package for each national
    Graydon> character set standard.

To throw a bone in for consideration:

Jim Blandy is currently extending guile, the GNU Scheme, to do multibyte
characters.  This same question came up for them, and what he has opted
to do was to use the Emacs/MULE encoding for a brief period, and then
use ISO 10646 using a UTF-8ish encoding; MULE's private encoding sucks
so much that Emacs, too, is moving in this direction.  To accomodate
this change, they're introducing a simple C API that abstracts away from
character set issues.

You do lose the ability to address individual characters in a string
with this--but it has been pointed out that very few string algorithms
do, in fact, need to address individual characters in a string.  Most
seem to want some sort of iteration over the string; thus regexps,
typesetting, etc, are not horribly inconvenienced by the use of a
variable length encoding, and Western languages are not forced to use 32
bit characters when unnecessary.

FWIW: I know nothing about encoding Chinese, but all the information
I've seen about Unicode coming from CJKV countries has stated that the
glyph unification is simply not acceptable for them.  Rather than second
guess these people, I'm more inclined to believe that it really is a
problem.  So it would seem to me that 10646 would be the way to go; the
extra pain of having a 32 bit character is mostly alleviated through the
UTF-8 encoding, and 2 billion individual code points should be, and
appears to be, more than enough for everyone.
Graham Hughes <graham@ccs.ucsb.edu>
GPG Fingerprint: 4FC5 80F0 63EB 00BE F438  E365 084B 4010 60BF 17D3
((lambda (x) (list x (list 'quote x)))
 '(lambda (x) (list x (list 'quote x))))

To UNSUBSCRIBE, email to design-request@berlin-consortium.org
with a subject of "unsubscribe". Trouble? Contact listmaster@berlin-consortium.org

Reply to: