[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 locales



Hi,

At Mon, 20 Nov 2000 11:28:25 -0600,
David Starner <dvdeug@x8b4e516e.dhcp.okstate.edu> wrote:

> As for the reason I don't use wchat_t, not all the world works in C.
> Most other languages have roll-your-own support for multi-byte character
> sets or provide Unicode support.

This is true that languages other than C/C++ generally don't have support
for wchar_t.  So, multibyte people may be annoyed with softwares written
in these computer languages.  I am interested in finding some tricks to
overcome this problem, especially for Perl.

However, I want developers to use locale than hard-codeing of UTF-8
at least when using C/C++.

> Unfortunetly, there are multiple CJK encodings, and there are multiple CJK
> character sets for each encoding. As this is not a scratch I need to itch,
> I'm not going to mess with increasing complexibilty to support it.

Using wchar_t, you don't need to be aware of these encodings nor character
sets.  It does not increase any complexity, than hard-coding UTF-8.
Please change your mind and all CJK people will be glad.

For example,

int a = getchar();       ---->   wint_t a = getwchar();
char str[] = "string";   ---->   wchar_t str[] = L"string";
strlen(), strchr(), ...  ---->   wcslen(), wcschr(), ...

Do you think this is complex than hard-coding UTF-8?

# More detailed list and simple examples of rewriting softwares using
# wchar_t are found in my document on i18n
# http://www.debian.org/doc/manuals/intro-i18n/
# though it is in great rewring now.


> If we're talking on the encoding level, there's only one encoding - a
> sequence of byte-sized characters. Most programs have no need to know
> the difference between Latin-1 and Latin-3 and KOI8-R, and it's the
> most trivial encoding to use. 

Agree.  Softwares such as 'echo' and 'cat' don't need modification.
However, imagine 'ls'.  It formats its output using the visible width of
file names.  Number of columns on tty may be different from number of
characters, and, you know, number of characters is different from
number of bytes even in UTF-8.

For conversion from number of characters to number of columns, you 
will need to use wcwidth() or wcswidth().  Note that you will have to
think about this _even if_ you use UTF-8, not wchar_t.  Thus, using
wchar_t does not increase any complexity than hard-coding UTF-8.

For conversion from number of bytes to number of characters,  usage
of wchar_t will reduce complexity.  One wchar_t means exactly one
character, even in multibyte locales such as EUC-JP and UTF-8.

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/



Reply to: