Re: I/O for different encodings
Hi,
At 09 Nov 2000 11:30:26 -0500,
Itai Zukerman <zukerman@math-hat.com> wrote:
> I'm working on a piece of software that will parse textual data (a
> list of words), conduct some statistical analyses, and spit out more
> textual data. I'd like to support multiple languages, maybe even
> multibyte encodings. Can someone please point me towards some
> resources, in particular how to handle text input and output in a
> language-independent way? As you can probably guess, I'm new to i18n.
I recommend you to use wchar_t-related functions. I think this is the
standard way to handle multiple encodings.
Using wchar_t, the software can be locale-sensible. If a user sets
his/her environment as UTF-8 (for example, export LC_CTYPE=en_US.UTF-8)
the software can input/output UTF-8. This is important all that a
user has to do is to set LANG variable for all softwares to work within
the desiable locale.
Your software will need to call setlocale(LC_ALL,""); at first.
LC_ALL may be LC_CTYPE. Input is done using fgetc() + mbstowcs()
or fgetwc(). Output is done using wcstombs() + fputc() or fgetwc().
Principle is that: I/O is done using 'multibyte charachter' and
internal processing is done using 'wide character'. Note that
multibyte character is not always multibyte. It is a term in the
C standard.
Note that your software will also support UTF-8 with wchar_t style
programming, if the OS supports UTF-8 locale. Direct implementing
of UTF-8 will cause that:
- the software supports only UTF-8 (and latin-1?).
- the user cannot use common way to switch locale.
Since these functions are standard C functions, you will be able to
study wchar_t with various books.
Please read the following:
http://www.debian.org/doc/manuals/intro-i18n/index.html
This is a Debian Documentation Project document on i18n, whose
writer is me. (I have to update it for glibc 2.1.9x...)
---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/
Reply to: