[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale



Andrew McMillan writes:
> On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:

>> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
...
> So given a character which is outside of the 0x00 <= 0x7f range, in an
> environment which does not specify an encoding, I would like to one day
> be able to categorically state that "Debian will by default assume that
> character is unicode, encoded according to UTF-8".

Damn right. The obscure languages of the world are numerous. Unlike
the languages of countries that were wealthy enough to participate
in native-language computing prior to UTF-8, these less-popular
languages are getting done in UTF-8. We mostly aren't inventing
new incompatible encodings.

> In such an environment, with a C.UTF-8 encoding selected, when I start a
> word processing program and insert an a-umlaut in there, I would expect
> that my file will be written with a UTF-8 encoded unicode character in
> it.  I would not expect that if I sort the lines in that file, that the
> lines beginning with a-umlaut would sort before 'z'.

Right...

> I would not expect
> that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
> would appear.

No. It's a letter in the Unicode spec.

> The proposal, at this stage is only that the C.UTF-8 locale is
> *installed* and *available* by default.  Not that it *be* the default,
> but that it *be there* as a default. People will naturally continue to
> be free to uninstall it, or to leave their locale to 'C'.

What if you don't set your locale to anything, or if you set it
to something that isn't recognized? You should get UTF-8 in any
of those cases.

The mechanism isn't so important. It could be that the fallback
locale used by gettext is no longer "C" (perhaps "C.UTF-8"), or it
could be that the "C" locale does UTF-8.

LC_ALL=pirate  -->  you get UTF-8, with messages from pirate.mo

> Yes, I think that the C.UTF-8 locale offers something different that the
> C locale doesn't.  Primarily it offers us a way out of the current
> default encodings which are legacy encodings, without jumping boots and
> all into a world where suddenly our sort ordering is changed, and our
> users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
> 'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
> scripts expect are broken because they expect a different behaviour.

> I believe that the list above might be the set of smallest useful
> incremental changes in this process.  I would really like to see that
> second step taken too, where the default locale is set to the most basic
> UTF-8 locale possible, but I'm happy to see a second bug and further
> discussion, if that's what we need to do to get agreement.

There are different meanings of "default".

By default, the locale should not be set in the environment.
That should give UTF-8. It could map to "C", "C.UTF-8", "(nil)",
or whatever.

>> I still think that "en_US.UTF-8" is the right default (note:
>> I'm not a US citizen, nor I speak English).

As a US citizen who does speak English, I guess I'm an authority
on the en_US.UTF-8 locale. It is offensively defective. It sorts
stuff in a crazy order designed by some moronic committee.
I doubt it even accepts Cyrillic and Korean as having letters.



Reply to: