[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale



Roger Leigh writes:
> On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
>> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):

>>> Except the ton which sets LC_ALL=C to get sane (parsable,
>>> dependable, historically compatible) output.
>>
>>> These would then unset all other LC_* and LANG and LANGUAGE,
>>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>>
>> Isn't setting LC_ALL=C.UTF-8 going to be about the same and less work?
>> I'm genuinely interested if that would behave any different to what you
>> said (unsetting all, setting LC_CTYPE).
>
> % sudo localedef -c -i POSIX -f UTF-8 C.UTF-8
>
> % LANG=C.UTF8 locale charmap
> UTF-8
>
> % LANG=C locale charmap
> ANSI_X3.4-1968
>
> This appears to work correctly at first glance.
>
> However, I would ideally like the C/POSIX locales to be UTF-8
> by default as on other systems (with a C.ASCII variant if required).

By far the most critical thing is that the <wctype.h> functions
work in the normal Unicode manner, with wchar_t assumed to be
purely Unicode. This means iswupper() works, towupper() works, etc.

This applies for locales called "", "C", and "some-unknown-junk".
The only possible exception would be when there are environment
variables set which are known to need something else. Unrecognized
locales and all other defaults have to support full Unicode.

Note that none of the above necessarily requires UTF-8, though UTF-8
seems desirable. You could use Latin-1 and still have wchar_t work.
This could all be configurable of course. Suppose /etc/locale had:

"" UTF-8        # setlocale with "" and no environment variables
"C" Latin-1     # if the "C" locale is specifically requested
unknown UTF-8   # if we don't recognize the locale
broken UTF-8    # if parts of the locale info are missing/broken

Right now, gettext doesn't even distinguish those cases. This could
be considered part of the problem. When I put a zam.mo file (Zapotec)
in the right place and set LC_ALL to "zam", I get the "C" locale!!!
Any imperfection in a locale results in "C", as ASCII as can be.



Reply to: