[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Locale-related questions



Le tridi 13 frimaire, an CCXXIV, Martin Str|mberg a écrit :
> I understand the reference to (the mess from) MS (although at that
> time everyone in America did think 64k would be enough characters for
> everyone, didn't they?), but not the one to Sun.
> 
> Were they the ones that made sure that C's wchar_t is standardized as
> it it? Care to enlighten me?

Out of memory, it was more than thirteen years ago. They may have fixed
their design, but I doubt it given Solaris' religion on backwards
compatibility.

With Solaris' implementation of wchar_t (i.e. in the libc, but of course,
outside the Linux world, the libc and the kernel come from the same origin
and have the same name), the value for non-ASCII characters depends on the
locale.

More precisely, the value of wchar_t with UTF-8 locales is the Unicode code
point (same as GNU and any sane implementation), but with non-UTF-8 locales,
it is something else, usually derived from the octet value with markers in
the high order bits.

For example, for the character "U+00C9 LATIN CAPITAL LETTER E WITH ACUTE",
with LC_CTYPE=en_US.UTF-8, the wchar_t value is 0x000000c9, but with
LC_CTYPE=en_US (i.e. ISO-8859-1), the wchar_t value is something like
0x010000c9.

I do not remember if I tested the case of non-UTF-8 non-ISO-8859-1 locales.
My guess is, for "U+0399 GREEK CAPITAL LETTER IOTA", it is 0x00000399 for
LC_CTYPE=en_US.UTF-8, but for el_GR.ISO-8859-7, it would be something like
0x070000c9 (the ISO-8859-7 code for U+0399 is 0xc9).

In practice, that means that you can not use wchar_t values to query Unicode
databases, and that if the locale changes during the run of your program all
your wchar_t values become invalid.

(Sun is also responsible for making java's char type 16-bits and strings
UTF-16. This is another case of headdesk, or possibly headwall, although
entirely unrelated to the wchar_t issue. For those who do not know,
basically UTF-16 manages to combine almost all the drawbacks of UTF-8 with
almost all the drawbacks of UCS-4, plus a whole bunch of drawbacks of its
own.)

Regards,

-- 
  Nicolas George

Attachment: signature.asc
Description: Digital signature


Reply to: