[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: non-ASCII characters in /etc/locales.alias ?



Hi,

At Sun, 27 Jan 2002 22:55:59 -0500,
Glenn Maynard wrote:

> On Mon, Jan 28, 2002 at 11:17:09AM +0900, Tomohiro KUBOTA wrote:
> > > That was the original point: to not display locale aliases that can't be
> > > displayed in the current locale.  This simply can't be done reliably.
> > 
> > This is possible if we limit the environment is GNU libc.
> 
> Sorry, I'm not sure what you mean.
> 
> > The current locale can be obtained by nl_langinfo(CODESET).
> > And, the "locale" utility is shipped with GNU libc distribution,
> > it is possible for GNU "locale" utility to use nl_langinfo()
> > without sacrifying portability.
> 
> Of course we know the current encoding, but we don't know the encoding
> of the text in /etc/locale.aliases.  (Unless we assume it's ISO-8859-1.)

I meant that /etc/locale.alias is a mere byte stream without any
encodings.  If any characters are assigned in codepoints of 0xe5
and 0xe7 in the encoding of the current locale, these locale
names can be regarded to be valid.

To test this, there are two ways:
(1) use nl_langinfo(CODESET) and consult a hard-coded table to
    determine whether 0xe5 and 0xe7 are valid characters.
(2) use nl_langinfo(CODESET) to know the current encoding of the
    current locale, and then use iconv() from the encoding to
    the current encoding (or UTF-8).  If it succeeds, 0xe5 and
    0xe7 are valid characters in the locale.

Since GNU utilities must be portable (must run on non-Linux systems),
usage of nl_langinfo() should be avoided in general GNU utilities
like fileutils.  However, "locale" command is shipped within GNU
libc distribution and is designed to be used with GNU libc.  So,
"locale" command can use these functions.

Note that "valid" does not mean 0xe5 and 0xe7 are same characters
as in ISO-8859-1.  For example, even if 0xe5 and 0xe7 mean Cyrillic
characters, they are valid.

On reading the current discussion, I am afraid that you (or someone)
misunderstand that the above test will result in "valid" in any
locales other than "C" locale.  This is not true.  0xe5 and 0xe7
are not valid characters in many encodings such as UTF-8 and EUC-JP.



> > However, I prefer that "locale -a" doesn't display 8bit
> > locale names at all, than "locale -a" displays 8bit locale
> > names in ISO-8859-1 locales (or locales where 0xe5 and
> > 0xe7 are valid codepoints).
> 
> Right.

:)  This means what I wrote above is needless.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/



Reply to: