[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale



Andrew McMillan wrote:
On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:
So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?

It is not a stupid question, and the answer is not the UTF-8 algorithm
to code/decode unicode.
I'm still thinking that you are confusing the various meanings.
And until I understand the problem, I cannot propose a solution.

While it is true that the C locale is (already) a UTF-8 compatible
locale, it provides no clues to the system for the encoding of
characters outside that locale.

We can all be pure about the C locale and believe that all characters
have 7 bits, but we all know that reality is not like that.  It's not
like that even in the northern part of the content pair that 'ASCII'
gets it's name from.

I believe that Debian should endorse Unicode as the preferred method for
mapping between numbers to characters.  I do not expect there is any
real argument against this, although I do understand that current
versions of Unicode may not yet comprehensively/satisfactorily represent
all glyphs in some languages.  I think there is hope that these problems
will eventually be ironed out.

There are, of course, a number of systems for encoding unicode
characters, but I do not seriously expect that anyone is recommending
that Debian should use UTF-16, UTF-32 (or, $DEITY forbid, Punycode :-)
as something which should be available everywhere.
>
So given a character which is outside of the 0x00 <= 0x7f range, in an
environment which does not specify an encoding, I would like to one day
be able to categorically state that "Debian will by default assume that
character is unicode, encoded according to UTF-8".

I agreem but the last sentence.
"Debian will use as default unicode, encoded according to UTF-8", but
not *assume*.  It is again portability.  Let (old) programs to works
also on the future Debian.

Note that the problem with ASCII7 arise also to other encoding.
We are Europeans or Americans, so UTF-8 seems an easy transition,
but for people who use other non-ASCII based encoding, this could be
very hard.  If we start assuming UTF-8 we cause a lot of troubles in
other continents.  Files which were readable in Lenny will be readable
in future only using a command line utility, what a nightmare for our
users!


So if your first paragraph are a nice objective, we should not
add "assumptions" that causes more troubles.
I think the opposite direction will be the best: let assume
less about locale, and let user and system to find and choose
the right encodings.
I.e. let me read file with "less" in many encodings
(heuristic, magic strings, or command line argument), instead of building
"less" to assume UTF-8.


We have the same objective, but two different ways. And because
I used and use a lot of different systems, I think my way is the best.


In such an environment, with a C.UTF-8 encoding selected, when I start a
word processing program and insert an a-umlaut in there, I would expect
that my file will be written with a UTF-8 encoded unicode character in
it.  I would not expect that if I sort the lines in that file, that the
lines beginning with a-umlaut would sort before 'z'.  I would not expect
that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
would appear.

I think nobody should use "C" or "C.UTF-8" as user encoding.
And I really hope that Debian will try to convince user to use a
proper locale.


At present I don't believe that this does happen.  At present we
continue to perpetuate encodings such as ISO 8859-1 in these situations,
making pain for our children and grandchildren to resolve.

No, I think Debian is really pushing UTF-8, and fortunately we can
distinguish automatically ISO 8859-1 from UTF-8 (but few "degenerate"
cases). This could help.  But world is not only ASCII based, so
mandate UTF-8 will causes more trouble.

I think we can do more heuristic to find the right encoding,
and encouraging programmers to annotate file with the right
encoding (you see more and more file with tell explicitly
the editor about the encoding).

So as a first step in this process of 'cleaning up our world', this bug
is proposing a smaller change than that, and a smaller change than I
believe you think it is.

It helps you, it helps Europeans and Americans, but it doesn't help
writing program that all world could use (also to read older documents).

Setting a real locale (not "POSIX" or "C") solve this, and BTW is
what Debian is doing.
C.UTF-8 will create a new locale, not destroying one, so not going
in the right direction.


The proposal, at this stage is only that the C.UTF-8 locale is
*installed* and *available* by default.  Not that it *be* the default,
but that it *be there* as a default. People will naturally continue to
be free to uninstall it, or to leave their locale to 'C'.


Once this minimum step is made, and we've all calmed down, we can think
further on radical and dramatic changes over coming years where more
significant shifts are made, like:

* The default locale at installation is C.UTF-8 rather than C.

BTW is not C.  The real default is en_US.UTF-8 (if you press
Enter continously on installation time), so already a UTF-8
encoding. We could hide further the non UTF-8 encoding
(but it seems that in lenny the other encoding are already
hidden, in "European" languages)

* The default locale at installation is assigned based on the
installation language.

Already in Lenny

* If a locale is set which doesn't specify an encoding, the system
defaults to assuming UTF-8.

ok. "C" is not the default in POSIX. Systems can choose any locale
But only in few case we need it. Locales are normally set.
So let look at the different cases when we have no locale,
and see why, and the best solution (debootstrap, ssh on
some remote machine (ok outside debian), ...)

* All ISO8859 locales are moved to a new locales-legacy-encodings
package.

This encoding is used also on CD/, floppy, remote filesystems, USB pens,
on a lot of internet pages, etc.

So we can discurage in new contents, but we must be able to read the
actual and the old world!


* ... and so on.


Yes, I think that the C.UTF-8 locale offers something different that the
C locale doesn't.  Primarily it offers us a way out of the current
default encodings which are legacy encodings, without jumping boots and
all into a world where suddenly our sort ordering is changed, and our
users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
scripts expect are broken because they expect a different behaviour.

But an ASCII7 "C" encoding allow you to do the same things. It doesn't
forbid 8 bit characters (thus UTF-8). Unix is transparent on characters
(i.e. binary and text are the same, you can grep binaries, ...).

So scripts should use LANG=C on most cases.

If you have trouble seems characters, is because the terminal.
In this case we can force terminal to use UTF-8 on "C" encoding
(as an option) or you should use a real locale.  In this case
you are the user, so you can choose the right localte.


There are problem with binary code, when compiler run in a different
locale, and when code was not so "portable". But this is a different
problem, which require a different solution (possibly at building time)


I believe that the list above might be the set of smallest useful
incremental changes in this process.  I would really like to see that
second step taken too, where the default locale is set to the most basic
UTF-8 locale possible, but I'm happy to see a second bug and further
discussion, if that's what we need to do to get agreement.

already in Lenny.



- terminals should be sensible to charsets, on choosing how to display
   things
- programs should be sensible to locales (topic of this discussion):
   the locales provides some charsets dependent strings, and interpretation
   of the various characters, but (usually) they MUST NOT translate characters.

Not so.  They have to consider how to handle input also, unless by
'terminal' you mean any program which might handle character input and
output...

An example I have had in the last week was that some software processing
information from the internet was converting &nbsp; into the character
0xa0.  While I have now stopped using that particular software
(Html::Strip, if anyone's interested), it illustrates exactly how
software currently doesn't know, and through not knowing it can
perpetuate encoding systems which need to die.

No, I mean true terminals. Programs should be usually transparent to
encoding (when used as filter, etc..).  The "sed" would not have
such problem.  Hmm. but 0xa0 should be specified by number.
If che c=0xa0 was in source ok, I see the problem, but most of
language permit annotation at the top of the source.

ciao
	cate



Reply to: