[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale



On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:
> 
> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
> 
> It is not a stupid question, and the answer is not the UTF-8 algorithm
> to code/decode unicode.
> I'm still thinking that you are confusing the various meanings.
> And until I understand the problem, I cannot propose a solution.

While it is true that the C locale is (already) a UTF-8 compatible
locale, it provides no clues to the system for the encoding of
characters outside that locale.

We can all be pure about the C locale and believe that all characters
have 7 bits, but we all know that reality is not like that.  It's not
like that even in the northern part of the content pair that 'ASCII'
gets it's name from.

I believe that Debian should endorse Unicode as the preferred method for
mapping between numbers to characters.  I do not expect there is any
real argument against this, although I do understand that current
versions of Unicode may not yet comprehensively/satisfactorily represent
all glyphs in some languages.  I think there is hope that these problems
will eventually be ironed out.

There are, of course, a number of systems for encoding unicode
characters, but I do not seriously expect that anyone is recommending
that Debian should use UTF-16, UTF-32 (or, $DEITY forbid, Punycode :-)
as something which should be available everywhere.

So given a character which is outside of the 0x00 <= 0x7f range, in an
environment which does not specify an encoding, I would like to one day
be able to categorically state that "Debian will by default assume that
character is unicode, encoded according to UTF-8".

In such an environment, with a C.UTF-8 encoding selected, when I start a
word processing program and insert an a-umlaut in there, I would expect
that my file will be written with a UTF-8 encoded unicode character in
it.  I would not expect that if I sort the lines in that file, that the
lines beginning with a-umlaut would sort before 'z'.  I would not expect
that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
would appear.

At present I don't believe that this does happen.  At present we
continue to perpetuate encodings such as ISO 8859-1 in these situations,
making pain for our children and grandchildren to resolve.


So as a first step in this process of 'cleaning up our world', this bug
is proposing a smaller change than that, and a smaller change than I
believe you think it is.


The proposal, at this stage is only that the C.UTF-8 locale is
*installed* and *available* by default.  Not that it *be* the default,
but that it *be there* as a default. People will naturally continue to
be free to uninstall it, or to leave their locale to 'C'.


Once this minimum step is made, and we've all calmed down, we can think
further on radical and dramatic changes over coming years where more
significant shifts are made, like:

* The default locale at installation is C.UTF-8 rather than C.
* The default locale at installation is assigned based on the
installation language.
* If a locale is set which doesn't specify an encoding, the system
defaults to assuming UTF-8.
* All ISO8859 locales are moved to a new locales-legacy-encodings
package.
* ... and so on.


Yes, I think that the C.UTF-8 locale offers something different that the
C locale doesn't.  Primarily it offers us a way out of the current
default encodings which are legacy encodings, without jumping boots and
all into a world where suddenly our sort ordering is changed, and our
users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
scripts expect are broken because they expect a different behaviour.


I believe that the list above might be the set of smallest useful
incremental changes in this process.  I would really like to see that
second step taken too, where the default locale is set to the most basic
UTF-8 locale possible, but I'm happy to see a second bug and further
discussion, if that's what we need to do to get agreement.


> - terminals should be sensible to charsets, on choosing how to display
>    things
> - programs should be sensible to locales (topic of this discussion):
>    the locales provides some charsets dependent strings, and interpretation
>    of the various characters, but (usually) they MUST NOT translate characters.
Not so.  They have to consider how to handle input also, unless by
'terminal' you mean any program which might handle character input and
output...

An example I have had in the last week was that some software processing
information from the internet was converting &nbsp; into the character
0xa0.  While I have now stopped using that particular software
(Html::Strip, if anyone's interested), it illustrates exactly how
software currently doesn't know, and through not knowing it can
perpetuate encoding systems which need to die.


> Anyway:
> 
> The locale C is already a UTF-8 compatible locale.
> No? so what it misses?
> - other alphabetic, numeric, currency, whitespace characters?  But not UTF-8
>    local provides all characters: they define only the needed range for the
>    language [see wikipedia, which should code UTF-8 as binary for this reason].
>    The "C" "spoken" language require only ASCII-7 (or maybe only a subrange of it).
>    So why we need further characters?
>    Note: whitespace are restricted in "C" locale by POSIX, in only two values
> 
>    We could use charset UTF-8 for C locale, declaring unused/illegal all
>    c > 127.  Whould this solve the problems with mksh? I don't think so,
>    so what you need in this C.UTF-8?
> 
> I still think that "en_US.UTF-8" is the right default (note:
> I'm not a US citizen, nor I speak English).

Note that this proposal is not that we change the default sort ordering
or character typing, which en_US *would* do (vs C).

This proposal (if it were that strong) would be pushing for adoption of
UTF-8 encoding as the default encoding.  It isn't as strong as that,
though.  It is merely pushing for the *availability* of a UTF-8 locale
on a default install.


> The installation will install the correct locale, so the en_US period is very
> short (we'll dominate them ;-) ).
> 
> On debootstrap/pbuild/... things are different.  But if it this the problem,
> let check a solution for building environment (and I still think that in this
> env "en_US.UTF-8" could be nice.
> 
> But I'll prefer a simple basic ASCII-7 "C" for basic/plain build, and only
> after packager thinks if it is a bug or a feature to have a specific build with
> UTF-8, it should manually set it.
> Why build need to depend to a locale?
> UNIX way is to allow to compile things for remote (maybe other OS, other arch)
> system.
> For testing? So why not test various locales (UTF-8, but also other non
> ascii based encodings)

What environments people build or test in is a separate issue to what
environments are available to them to build or test in, and indeed Steve
Langasek has already suggested a seemingly reasonable workaround for the
immediate problem which was, funnily enough, that mksh wants to have a
UTF-8 locale *available* in order for it to *test the build*...

So we could close this bug as 'why bother', really, but the discussion
is much more important than that.

Regards,
					Andrew McMillan.

------------------------------------------------------------------------
andrew (AT) morphoss (DOT) com                            +64(272)DEBIAN
              Does the turtle move for you?  www.kame.net
------------------------------------------------------------------------





Reply to: