[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: I/O for different encodings



At Fri, 10 Nov 2000 00:01:10 +0200,
Shaul Karl <shaulka@bezeqint.net> wrote:
> > I'm working on a piece of software that will parse textual data (a
> > list of words), conduct some statistical analyses, and spit out more
> > textual data.  I'd like to support multiple languages, maybe even
> > multibyte encodings.  Can someone please point me towards some
> > resources, in particular how to handle text input and output in a
> > language-independent way?  As you can probably guess, I'm new to i18n.
> 
> Not sure but I believe that everything is in the process of convergence to 
> Unicode (UTF8). Therefore, if I would have written such a program I would make 
> it to use this encoding.
> As for resources, there is a Unicode HOWTO on the LDP and many other resources 
> on the net.
> Hope this helps. 

If you want to analyse multiple languages including far east CJKVT characters,
you should change your program to deal with multiple encodings.
UTF-8 is no ability to distinct Chinese Hanji/Japanese Kanji/Korean Kanji/...
(CJKVT). Because Han-Unification is combined with these thousands of
characters into one Unicode BMP area. So, if you think to distinct these
far east characters is important, Unicode/UTF-8 is inappropriate.

In addition, be careful to use translating to UCS-4 (glibc is used as wchat_t).
It also uses Unicode BMP, so analysis is difficult for these CJKVT characters.
The better solution is you modify your program to be able to handle text
input and output independently.
glibc 2.2's localedata/charmaps or iconvdata may become some helpful for you.

You do not intend to be aware of CJKVT characters, ignore my mail
(and UTF-8/UCS-2/UCS-4 is easy and nice solution).

Regards,
-- GOTO Masanori



Reply to: