Re: I/O for different encodings

To: shaulka@bezeqint.net
Cc: debian-i18n@lists.debian.org, zukerman@math-hat.com
Subject: Re: I/O for different encodings
From: GOTO Masanori <gotom@debian.or.jp>
Date: Fri, 10 Nov 2000 17:08:36 +0900
Message-id: <[🔎] 14859.44292.729334.81928B@fe.dis.titech.ac.jp>
In-reply-to: In your message of "Fri, 10 Nov 2000 00:01:10 +0200" <[🔎] E13tzkw-0000yK-00@rakefet>
References: <[🔎] 877l6d2jml.fsf@matt.w80.math-hat.com> <zukerman@math-hat.com> <[🔎] E13tzkw-0000yK-00@rakefet>

At Fri, 10 Nov 2000 00:01:10 +0200,
Shaul Karl <shaulka@bezeqint.net> wrote:
> > I'm working on a piece of software that will parse textual data (a
> > list of words), conduct some statistical analyses, and spit out more
> > textual data.  I'd like to support multiple languages, maybe even
> > multibyte encodings.  Can someone please point me towards some
> > resources, in particular how to handle text input and output in a
> > language-independent way?  As you can probably guess, I'm new to i18n.
> 
> Not sure but I believe that everything is in the process of convergence to 
> Unicode (UTF8). Therefore, if I would have written such a program I would make 
> it to use this encoding.
> As for resources, there is a Unicode HOWTO on the LDP and many other resources 
> on the net.
> Hope this helps. 

If you want to analyse multiple languages including far east CJKVT characters,
you should change your program to deal with multiple encodings.
UTF-8 is no ability to distinct Chinese Hanji/Japanese Kanji/Korean Kanji/...
(CJKVT). Because Han-Unification is combined with these thousands of
characters into one Unicode BMP area. So, if you think to distinct these
far east characters is important, Unicode/UTF-8 is inappropriate.

In addition, be careful to use translating to UCS-4 (glibc is used as wchat_t).
It also uses Unicode BMP, so analysis is difficult for these CJKVT characters.
The better solution is you modify your program to be able to handle text
input and output independently.
glibc 2.2's localedata/charmaps or iconvdata may become some helpful for you.

You do not intend to be aware of CJKVT characters, ignore my mail
(and UTF-8/UCS-2/UCS-4 is easy and nice solution).

Regards,
-- GOTO Masanori

Reply to:

References:
- I/O for different encodings
  - From: Itai Zukerman <zukerman@math-hat.com>
- Re: I/O for different encodings
  - From: Shaul Karl <shaulka@bezeqint.net>

Prev by Date: Re: I/O for different encodings
Next by Date: Re: I/O for different encodings
Previous by thread: Re: I/O for different encodings
Next by thread: Re: I/O for different encodings
Index(es):
- Date
- Thread