[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: I/O for different encodings


At 09 Nov 2000 11:30:26 -0500,
Itai Zukerman <zukerman@math-hat.com> wrote:

> I'm working on a piece of software that will parse textual data (a
> list of words), conduct some statistical analyses, and spit out more
> textual data.  I'd like to support multiple languages, maybe even
> multibyte encodings.  Can someone please point me towards some
> resources, in particular how to handle text input and output in a
> language-independent way?  As you can probably guess, I'm new to i18n.

I recommend you to use wchar_t-related functions.  I think this is the 
standard way to handle multiple encodings.

Using wchar_t, the software can be locale-sensible.  If a user sets
his/her environment as UTF-8 (for example, export LC_CTYPE=en_US.UTF-8)
the software can input/output UTF-8.  This is important all that a 
user has to do is to set LANG variable for all softwares to work within
the desiable locale.

Your software will need to call setlocale(LC_ALL,""); at first.
LC_ALL may be LC_CTYPE.  Input is done using fgetc() + mbstowcs()
or fgetwc().  Output is done using wcstombs() + fputc() or fgetwc().
Principle is that: I/O is done using 'multibyte charachter' and
internal processing is done using 'wide character'.  Note that
multibyte character is not always multibyte.  It is a term in the
C standard.

Note that your software will also support UTF-8 with wchar_t style
programming, if the OS supports UTF-8 locale.  Direct implementing
of UTF-8 will cause that:
 - the software supports only UTF-8 (and latin-1?).
 - the user cannot use common way to switch locale.

Since these functions are standard C functions, you will be able to
study wchar_t with various books.

Please read the following:
This is a Debian Documentation Project document on i18n, whose
writer is me.  (I have to update it for glibc 2.1.9x...)

Tomohiro KUBOTA <kubota@debian.org>

Reply to: