[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 locales



Hi,

At Thu, 16 Nov 2000 09:40:26 +0000,
Edmund GRIMLEY EVANS <edmundo@rano.org> wrote:

> >  You are right... the i18n in Linux is not coming well, everybody seems to
> > implement their own scheme...
> >  Besides, GNU having choosen a sizeof(wchar_t)==4 doesn't help to encourage
> > using libc's locale support... =/

Consumption of memory is less important than whether I can use my
daily encodings (EUC-JP, ISO-2022-JP, and so on) or canoot at all.

I didn't think of developers who hesitate to use wchar_t because of 
its memory consumption.  I cannot believe, since memory consumption
is too trifling problem compared with the problem whether a user
can use the software or not.

I will agree with developers who dare to hard-code UTF-8 instead of 
wchar_t, if they abolish the support of 8bit (or 7bit) encoding by the
softwares which they develop.  I mean, if they need their (European-
language speakers, in most cases) daily (i.e., 7 and 8bit) encodings
(i.e., if they don't abolish the support of 7 or 8bit encodings), why
do they choose not to support our daily encodings?


> If you are suggesting that sizeof(wchar_t) could be 2, then please
> explain what you think mbtowc(&wc, "\360\220\200\200", 4) should do in
> a UTF-8 locale, and why you think that would be easier for

We cannot assume anything on the concrete value of wchar_t variables.
If a certain system uses the UCS-2 as an internal expression of wchar_t,
that call of mbtowc() will fail.  However, there can be a system whose
sizeof(wchar_t) is 2 and whose internal expression of wchar_t is not
UCS-2, which does not fail for such a mbtowc() call.  

# Ok, such a system is not likely to exist.  I wanted to say that
# UCS is not only candidate for internal expression of wchar_t.
# For example, it is likely there is a system whose wchar_t is
# Mule-like code, i.e., some bits for specifying a coded character 
# set and other bits for code point in the character set.

FYI: "\360\220\200\200" in UTF-8 means u+10000.

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/



Reply to: