[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: non-ASCII characters in /etc/locales.alias ?



On Sat, Jan 19, 2002 at 03:25:58AM +0100, Tollef Fog Heen wrote:
> | I don't understand what you mean by this.  You mean, what is "wrong"?
> 
> LANG can be unset or set to POSIX, still you would be able to input
> for instance Norwegian characters without problem.

There is no byte or series of bytes in the POSIX locale that represents
the character "私", period.  It is impossible to enter Japanese text
when LANG=C or LANG=POSIX.

> | And, the output of "locale" command means what in this context?
> | (Of course I understand what "locale" command outputs.)
> 
> It shows that I am able to input Norwegian (and French) characters
> without configuring a locale.

LANG=POSIX corresponds to 7-bit ASCII.  7-bit ASCII does not contain any
of those characters, so you can't enter them.  At all.  All you did was
enter meaningless 8-bit characters that your terminal happens to
interpret as the characters you want.  Nothing will know what they are.

#include <locale.h>
#include <stdio.h>
#include <ctype.h>

main()
{
	char c;

	setlocale(LC_ALL, "");

	scanf("%c", &c);

	printf("%c is %s\n",
		c, isalpha(c)?"alphabetic":"not alphabetic");
}

03:10pm glenn@zewt.pts/5 [~] export LANG=POSIX
03:11pm glenn@zewt.pts/5 [~] ./a.out
û
û is not alphabetic
03:11pm glenn@zewt.pts/5 [~] export LANG=en_US.ISO-8859-1
03:11pm glenn@zewt.pts/5 [~] ./a.out
û
û is alphabetic

Another:

03:11pm glenn@zewt.pts/5 [~] export LANG=POSIX
03:12pm glenn@zewt.pts/5 [~] iconv
û
iconv: illegal input sequence at position 0

I send a mail to myself containing "û" in LANG=POSIX, and mutt says:

- I     1 /tmp/mutt-zewt-15291-0         [text/plain, 8bit, unknown-8bit, 0.1K]

("What are these 8-bit characters doing in this 7-bit textfile?")

Entering ISO-8859-1 when not in a locale using ISO-8859-1 is wrong.

> | >  What you need to do is configure your keymap properly.
> | 
> | This is wrong, because keymap is not enough for Japanese input.
> | Well, you cannot configure keymap to input Japanese.
> 
> You can for most other languages.

Most languages don't have thousands of possible kanji to input.  (Others
do have odd things, like ligatures, which I don't know anything about.)

> | BTW, the contents of your mail was illegal encoding ... It
> | contained my ISO-2022-JP-encoded Japanese and your 8bit
> | characters (0xe6, 0xf8, 0xe5, 0xe7), though the mail header
> | insists the contents is ISO-8859-1.  Of course, ISO-2022-JP-
> | encoded JIS X 0208 characters in ISO-8859-1 encoding is
> | illegal.
> 
> No, they are not illegal, they just don't represent what you thought
> they would.  That is,  is a perfectly legal character which can be
> represented using ISO-8859-1.  The other characters were ASCII.

Copying ISO-2022-JP to ISO-8859-1 without converting is wrong and you
get meaningless results.  (That it comes out to be "legal" doesn't
matter.)  And you can't convert ISO-2022-JP containing kanji to
ISO-8859-1.

> | (I imagine your 0xe6 0xf8 0xe5 0xe7 sequence in your mail is
> | intended to be ISO-8859-1, I imagined from your mail header.
> 
> Since my header shows that the body of the mail was in latin1 and I
> input those character, that is a reasonable assumption.

Since we were talking about Japanese characters, it's reasonable to
assume you entered some Japanese characters (Norwegian and French are
irrelevant) and they were misencoded.

-- 
Glenn Maynard



Reply to: