Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

To: Steve Langasek <vorlon@debian.org>, 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
From: Roger Leigh <rleigh@codelibre.net>
Date: Mon, 6 Apr 2009 22:52:26 +0100
Message-id: <[🔎] 20090406215226.GB18298@codelibre.net>
Reply-to: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
In-reply-to: <[🔎] 20090406180917.GA23092@dario.dodds.net>
References: <[🔎] 20090406120655.27815.2545.reportbug@lenny.mirbsd.org> <[🔎] 49DA0B6A.7060107@debian.org> <[🔎] Pine.BSM.4.64L.0904061727410.28766@herc.mirbsd.org> <[🔎] 20090406180917.GA23092@dario.dodds.net>

On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
> > > If you need a specific locale (as seems from "mksh", not
> > > sure if it is a bug in that program), you need to set it.
> 
> > You can only set a locale on a glibc-based system if it’s
> > installed beforehand, which root needs to do.
> 
> You can build-depend on the locales package and generate the locales you
> want locally, using LOCPATH to reference them.  There's no need for Debian
> to guarantee the presence of a particular locale ahead of time -
> particularly one that isn't actually useful to end users, as C.UTF-8 would
> be.

I think that it would be very useful, I'll detail why below.

The GCC toolchain has, for some time now, been using UTF-8 as the
internal representation for narrow strings (-fexec-charset).  It has
also been using UTF-8 as the default input encoding for C source code
(-finput-charset).  This means that unless you take any special
measures, your program will be outputting UTF-8 strings for all file
and terminal I/O.  Of course, this is backward compatible with ASCII,
and is also transcoded automatically when in a non-UTF-8 locale.  I've
attached a trivial example.  Just to be clear: this handling is
completely built into GCC and libc, and is completely transparent.

Now, this will work fine in all locales *except for C/POSIX*.
Obviously the charsets of some locales can't represent all the
characters used in this example, but the C library will actually
transcode (iconv) to the locale codeset as best it can.  Except for
C/POSIX.

Now, why is this needed?

If I write a program, I might want to use non-ASCII UTF-8 characters
in the sources.  We have been doing this for years without realising
since GCC switched to UTF-8 as the default internal encoding, but
simply for portability when using the C locale we are restricted to
using ASCII only in the sources, and then a translation library such
as libintl/gettext to get translated strings with the extended
characters in them.  This is workable, but it imposes a big burden on
translators because I might want to use symbols and other characters
which are not part of a /language/ translation, but need adding by
each and every translator through explicit translator comments in the
sources.  This is tedious and error-prone.  If the sources were UTF-8
encoded, this would work perfectly since I could just use the
necessary UTF-8 characters directly in the source rather than abusing
the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
thus cuts out a big pile of cruft and complexity in sources which only
exists to cater for people who want to run your code in a C locale!
And the translators can completely ignore the now no longer needed
job of translating special characters as well doing as the actual
translation work, so the symbol usage is identical in all
translations, and their job is much easier.

I've tested all this, and it all works *perfectly*.  Except that if
you do this, your program will not run in the C locale (and *only*
the C locale) due to having completely borked output.  A C.UTF-8 would
be a solution to this problem, and allow full use of the *existing*
UTF-8 string handling which all sources are built with, yet only a
tiny fraction dare to use.  Note that gettext is *completely disabled*
if used in a C locale, and this does additional mangling in addition
to the plain libc damage, resulting in *no output at all*!  (I would
need to double check that; this was the case when I last looked,
and the reason I had to abandon use of UTF-8 string literals.)

There are other uses for a UTF-8 C locale as well.  I've needed at
several times a UTF-8 locale at build time for various tasks,
mainly related to translation work.  While you mentioned it's
possible to do this by generation of locales at build time, in
practice I've found this rather error prone and unreliable.  Having
the C locale (which is the locale all our buildds use by default)
UTF-8 by default would make these jobs much easier.  Some of the
projects I work on such as gutenprint have needed to reimplement some
of the gettext internals to work around this in a portable manner.

Regarding the standards conformance of using a UTF-8 C locale:
I've spent some time reading the standards (SUSv3), and see no reason
why C can't use UTF-8 as its default codeset and still remain strictly
conforming.

The standards specifies a minimum requirement of a portable character
set and control character set.  This is satisfied by the 7-bit ASCII
encoding which we currently use as the C0 and G1 control and graphics
sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
it is eminently reasonable to use UTF-8 *and still remain conforming*
with the minimum functionality required by the standard.  It's
explicity spelled out in SUSv2, though the wording was dropped in
SUSv3 (definitely not forbidden, though).

POSIX/C locale:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02

Portable charset:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06

"Implementations may also add other characters."
This is from the charset documentation in SUSv2
http://opengroup.org/onlinepubs/007908775/xbd/charset.html

UTF-8 is the default character set on Debian GNU/Linux.  It's what
we all use, it's what all the tools use, and the C locale is the
last ASCII holdout.  It would make the lives of many maintainers
and users more bearable if it was also UTF-8, as well as getting
rid of the current buggy behaviour if you use UTF-8-encoded sources.
It's currently *the only blocker* preventing us using UTF-8 encoded
sources.

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Reply to:

Follow-Ups:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>

References:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Steve Langasek <vorlon@debian.org>

Prev by Date: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Next by Date: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Previous by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Next by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Index(es):
- Date
- Thread