Re: I/O for different encodings

To: zukerman@math-hat.com, debian-i18n@lists.debian.org
Subject: Re: I/O for different encodings
From: Tomohiro KUBOTA <tkubota@riken.go.jp>
Date: Fri, 10 Nov 2000 22:55:59 +0900
Message-id: <[🔎] 87aeb86ids.wl@surfchem0.riken.go.jp>
In-reply-to: In your message of "09 Nov 2000 11:30:26 -0500" <[🔎] 877l6d2jml.fsf@matt.w80.math-hat.com>
References: <[🔎] 877l6d2jml.fsf@matt.w80.math-hat.com>

Hi,

At 09 Nov 2000 11:30:26 -0500,
Itai Zukerman <zukerman@math-hat.com> wrote:

> I'm working on a piece of software that will parse textual data (a
> list of words), conduct some statistical analyses, and spit out more
> textual data.  I'd like to support multiple languages, maybe even
> multibyte encodings.  Can someone please point me towards some
> resources, in particular how to handle text input and output in a
> language-independent way?  As you can probably guess, I'm new to i18n.

I recommend you to use wchar_t-related functions.  I think this is the 
standard way to handle multiple encodings.

Using wchar_t, the software can be locale-sensible.  If a user sets
his/her environment as UTF-8 (for example, export LC_CTYPE=en_US.UTF-8)
the software can input/output UTF-8.  This is important all that a 
user has to do is to set LANG variable for all softwares to work within
the desiable locale.

Your software will need to call setlocale(LC_ALL,""); at first.
LC_ALL may be LC_CTYPE.  Input is done using fgetc() + mbstowcs()
or fgetwc().  Output is done using wcstombs() + fputc() or fgetwc().
Principle is that: I/O is done using 'multibyte charachter' and
internal processing is done using 'wide character'.  Note that
multibyte character is not always multibyte.  It is a term in the
C standard.

Note that your software will also support UTF-8 with wchar_t style
programming, if the OS supports UTF-8 locale.  Direct implementing
of UTF-8 will cause that:
 - the software supports only UTF-8 (and latin-1?).
 - the user cannot use common way to switch locale.

Since these functions are standard C functions, you will be able to
study wchar_t with various books.

Please read the following:
http://www.debian.org/doc/manuals/intro-i18n/index.html
This is a Debian Documentation Project document on i18n, whose
writer is me.  (I have to update it for glibc 2.1.9x...)

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/

Reply to:

References:
- I/O for different encodings
  - From: Itai Zukerman <zukerman@math-hat.com>

Prev by Date: Re: I/O for different encodings
Next by Date: UTF-8 locales
Previous by thread: Re: I/O for different encodings
Next by thread: UTF-8 locales
Index(es):
- Date
- Thread