Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

To: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
Cc: Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
From: "Giacomo A. Catenazzi" <cate@debian.org>
Date: Tue, 07 Apr 2009 10:36:20 +0200
Message-id: <[🔎] 49DB1084.9060707@debian.org>
Reply-to: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
In-reply-to: <[🔎] 20090406215226.GB18298@codelibre.net>
References: <[🔎] 20090406120655.27815.2545.reportbug@lenny.mirbsd.org> <[🔎] 49DA0B6A.7060107@debian.org> <[🔎] Pine.BSM.4.64L.0904061727410.28766@herc.mirbsd.org> <[🔎] 20090406180917.GA23092@dario.dodds.net> <[🔎] 20090406215226.GB18298@codelibre.net>

Roger Leigh wrote:

On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:

On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:

If you need a specific locale (as seems from "mksh", not
sure if it is a bug in that program), you need to set it.

You can only set a locale on a glibc-based system if it’s
installed beforehand, which root needs to do.

You can build-depend on the locales package and generate the locales you
want locally, using LOCPATH to reference them.  There's no need for Debian
to guarantee the presence of a particular locale ahead of time -
particularly one that isn't actually useful to end users, as C.UTF-8 would
be.


I think that it would be very useful, I'll detail why below.

The GCC toolchain has, for some time now, been using UTF-8 as the
internal representation for narrow strings (-fexec-charset).  It has
also been using UTF-8 as the default input encoding for C source code
(-finput-charset).  This means that unless you take any special
measures, your program will be outputting UTF-8 strings for all file
and terminal I/O.  Of course, this is backward compatible with ASCII,
and is also transcoded automatically when in a non-UTF-8 locale.  I've
attached a trivial example.  Just to be clear: this handling is
completely built into GCC and libc, and is completely transparent.


Hmm. Warning, you confuse some terms.
- input charset is the source charset (used to parse C code)
- exec charset is the charset of the target machine (which run the program).
- C99 must support unicode identifier (written with \uxxxx or in other
  non portable implementation defined way)
- standard libraries can use locales (but only if you initialized the locale),
  but not all the functions, not all uses.
- wide charaters are yet an other things (as you note in your example,
  the wide string is not in UTF-8, but I think UTF-32)

Same input and exec charset really means: don't translate strings
(e.g. in
   if(c = 'a') printf("bcde\n");
 'a' and "bcde\n" will have the same values as in the input file, else
 it will put in binary the representation of exec charset)

I expect that your program will run fine (i.e. really no changes: the
same binary output), if you use tell GCC that you use any other ASCII-7
derived 8-bit encoding (both for input and exec charset).

printf/wprintf uses locale only for numeric representation.

Usually the interpretation of bytes is done by terminal, not by compiler.

Now, this will work fine in all locales *except for C/POSIX*.
Obviously the charsets of some locales can't represent all the
characters used in this example, but the C library will actually
transcode (iconv) to the locale codeset as best it can.  Except for
C/POSIX.

Now, why is this needed?

If I write a program, I might want to use non-ASCII UTF-8 characters
in the sources.  We have been doing this for years without realising
since GCC switched to UTF-8 as the default internal encoding, but
simply for portability when using the C locale we are restricted to
using ASCII only in the sources,


Really minimal C charset is smaller than ASCII (a portable program
must not have "$" and no "@", plus C supports also smaller charset,
with trigraps [preprocessor] and/or new bigraphs [compiler])

and then a translation library such
as libintl/gettext to get translated strings with the extended
characters in them.  This is workable, but it imposes a big burden on
translators because I might want to use symbols and other characters
which are not part of a /language/ translation, but need adding by
each and every translator through explicit translator comments in the
sources.  This is tedious and error-prone.  If the sources were UTF-8
encoded, this would work perfectly since I could just use the
necessary UTF-8 characters directly in the source rather than abusing
the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
thus cuts out a big pile of cruft and complexity in sources which only
exists to cater for people who want to run your code in a C locale!
And the translators can completely ignore the now no longer needed
job of translating special characters as well doing as the actual
translation work, so the symbol usage is identical in all
translations, and their job is much easier.


yes, in a perfect world we need only one charset (and maybe only
one language and one locale). From all the proposals to reach this
target, unicode and UTF-8 seems the best solution.
But... for now take care about locales and don't assume UTF-8,
or you will cause trouble with a lot of non-UTF-8 users.
Converting locale (from non-UTF-8 to UTF-8) is simple for
English and few European languages, but it is a tedious work
for many user: it need a "flag day", in which I should convert
all my files to UTF-8 or annotate every file with the right
encoding (most of editors and tools understands such annotations).

So for now we support UTF-8, we try to set UTF-8 default to
new users, and UTF-8 is the encoding for debian files in packages.
But it will take a lot of years (or maybe never) before
we can assume UTF-8 if user don't loudly tell the system to
use other encodings.


 > I've tested all this, and it all works *perfectly*.  Except that if

you do this, your program will not run in the C locale (and *only*
the C locale) due to having completely borked output.


It is the terminal, not the C program.

 A C.UTF-8 would
be a solution to this problem, and allow full use of the *existing*
UTF-8 string handling which all sources are built with, yet only a
tiny fraction dare to use.  Note that gettext is *completely disabled*
if used in a C locale, and this does additional mangling in addition
to the plain libc damage, resulting in *no output at all*!  (I would
need to double check that; this was the case when I last looked,
and the reason I had to abandon use of UTF-8 string literals.)


Use "en_US.UTF-8".
"C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
for machine". Do we need to translate all strings also on C.UTF-8?
Which alphabetic characters?  Which numeric characters?  Which
alphabetic order? etc. etc.  You see: it is difficult to create
a new locale, and people must understand the meaning of such locale
(without reading all the locale definition). "en_US.UTF-8" is
clear.

There are other uses for a UTF-8 C locale as well.  I've needed at
several times a UTF-8 locale at build time for various tasks,
mainly related to translation work.  While you mentioned it's
possible to do this by generation of locales at build time, in
practice I've found this rather error prone and unreliable.  Having
the C locale (which is the locale all our buildds use by default)
UTF-8 by default would make these jobs much easier.  Some of the
projects I work on such as gutenprint have needed to reimplement some
of the gettext internals to work around this in a portable manner.


Regarding the standards conformance of using a UTF-8 C locale:
I've spent some time reading the standards (SUSv3), and see no reason
why C can't use UTF-8 as its default codeset and still remain strictly
conforming.


UTF-8 as a lot of characters (alphabetic, numeric, white).
C locale requires that whitespace are only SPACE and TAB.
I did look for all requirement, but I found that some requirement
are incompatible from what one should expect.

So a C local in UTF-8 would cause more trouble to users (no warning,
but the whitespace are missinterpreted (note: some windows editors
are know to insert a lot of non standard whitespace, instead of spaces).

The standards specifies a minimum requirement of a portable character
set and control character set.  This is satisfied by the 7-bit ASCII
encoding which we currently use as the C0 and G1 control and graphics
sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
it is eminently reasonable to use UTF-8 *and still remain conforming*
with the minimum functionality required by the standard.  It's
explicity spelled out in SUSv2, though the wording was dropped in
SUSv3 (definitely not forbidden, though).

POSIX/C locale:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02

Portable charset:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06

"Implementations may also add other characters."
This is from the charset documentation in SUSv2
http://opengroup.org/onlinepubs/007908775/xbd/charset.html


UTF-8 is the default character set on Debian GNU/Linux.  It's what
we all use, it's what all the tools use, and the C locale is the
last ASCII holdout.  It would make the lives of many maintainers
and users more bearable if it was also UTF-8, as well as getting
rid of the current buggy behaviour if you use UTF-8-encoded sources.
It's currently *the only blocker* preventing us using UTF-8 encoded
sources.


I think ASCII 7 would simplify the finding bugs.
An c>127 in a C locale is simply wrong, it will miss interpreted
by different terminal (local and remote, etc.).
Not always, there are terminal libraries and standard libraries that
do the right things, but with your proposal, I think in few months
programs will simply write UTF-8 to terminal, ignoring charset
choose by user.

Before was: all must use English because I understand English
now we want: all must use UTF-8 because I use UTF-8?

If English is the most spoken language (and easier to type), or
that UTF-8 is technically very good, doesn't mean that we
should oblige users to use English or UTF-8.

ciao
	cate

Reply to:

Follow-Ups:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Roger Leigh <rleigh@codelibre.net>

References:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Steve Langasek <vorlon@debian.org>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Roger Leigh <rleigh@codelibre.net>

Prev by Date: Bug#522218: debian-policy: Discourage installation of Speedo fonts
Next by Date: Bug#501930: Bug#501927: debian_bundle fails with empty lines containing a space
Previous by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Next by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Index(es):
- Date
- Thread