Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Cc: Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
From: Roger Leigh <rleigh@codelibre.net>
Date: Tue, 7 Apr 2009 22:33:24 +0100
Message-id: <[🔎] 20090407213324.GD12845@codelibre.net>
Reply-to: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
In-reply-to: <[🔎] 49DB1084.9060707@debian.org>
References: <[🔎] 20090406120655.27815.2545.reportbug@lenny.mirbsd.org> <[🔎] 49DA0B6A.7060107@debian.org> <[🔎] Pine.BSM.4.64L.0904061727410.28766@herc.mirbsd.org> <[🔎] 20090406180917.GA23092@dario.dodds.net> <[🔎] 20090406215226.GB18298@codelibre.net> <[🔎] 49DB1084.9060707@debian.org>

On Tue, Apr 07, 2009 at 10:36:20AM +0200, Giacomo A. Catenazzi wrote:

I can't help but feel that your reply completely missed the
purpose of what I want to do, and why.  I hope the following
response clears things up.

> Roger Leigh wrote:
>> On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
>>> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
>>>>> If you need a specific locale (as seems from "mksh", not
>>>>> sure if it is a bug in that program), you need to set it.
>>>> You can only set a locale on a glibc-based system if it’s
>>>> installed beforehand, which root needs to do.
>>> You can build-depend on the locales package and generate the locales you
>>> want locally, using LOCPATH to reference them.  There's no need for Debian
>>> to guarantee the presence of a particular locale ahead of time -
>>> particularly one that isn't actually useful to end users, as C.UTF-8 would
>>> be.
>>
>> I think that it would be very useful, I'll detail why below.
>>
>> The GCC toolchain has, for some time now, been using UTF-8 as the
>> internal representation for narrow strings (-fexec-charset).  It has
>> also been using UTF-8 as the default input encoding for C source code
>> (-finput-charset).  This means that unless you take any special
>> measures, your program will be outputting UTF-8 strings for all file
>> and terminal I/O.  Of course, this is backward compatible with ASCII,
>> and is also transcoded automatically when in a non-UTF-8 locale.  I've
>> attached a trivial example.  Just to be clear: this handling is
>> completely built into GCC and libc, and is completely transparent.
>
> Hmm. Warning, you confuse some terms.

I'm not really sure how relevant these minor points are to the general
point that I was trying to make.

> - input charset is the source charset (used to parse C code)
> - exec charset is the charset of the target machine (which run the program).

That's pretty much what I said.

> - C99 must support unicode identifier (written with \uxxxx or in other
>   non portable implementation defined way)

OK.  But that's really nothing to do with the fact that you can use
UTF-8 sources directly.  It's akin to having to support trigraphs,
but we don't use trigraphs because they are bloody annoying and nowadays
competelely unnecessary.  But mainly, it doesn't affect the exec charset
whether you use UTF-8 encoded sources or \uxxxx.

> - standard libraries can use locales (but only if you initialized the locale),
>   but not all the functions, not all uses.
> - wide charaters are yet an other things (as you note in your example,
>   the wide string is not in UTF-8, but I think UTF-32)
>
> Same input and exec charset really means: don't translate strings
> (e.g. in
>    if(c = 'a') printf("bcde\n");
>  'a' and "bcde\n" will have the same values as in the input file, else
>  it will put in binary the representation of exec charset)

Of course.  However, the test program I posted showed what that if the
locale has been appropriately initialised, there is an additional
translation between the exec charset and the output charset specified
by the locale (see the Latin characters correctly preserved and output
as ISO-8859-1 in an ISO-8859-1 locale).

> I expect that your program will run fine (i.e. really no changes: the
> same binary output), if you use tell GCC that you use any other ASCII-7
> derived 8-bit encoding (both for input and exec charset).

Of course.

> Usually the interpretation of bytes is done by terminal, not by compiler.

It's done at several points:
compiler: source->exec
runtime: locale-dependent exec->output (and optional use of gettext)
terminal: output->display

>> Now, this will work fine in all locales *except for C/POSIX*.
>> Obviously the charsets of some locales can't represent all the
>> characters used in this example, but the C library will actually
>> transcode (iconv) to the locale codeset as best it can.  Except for
>> C/POSIX.
>>
>> Now, why is this needed?
>>
>> If I write a program, I might want to use non-ASCII UTF-8 characters
>> in the sources.  We have been doing this for years without realising
>> since GCC switched to UTF-8 as the default internal encoding, but
>> simply for portability when using the C locale we are restricted to
>> using ASCII only in the sources,
>
> Really minimal C charset is smaller than ASCII (a portable program
> must not have "$" and no "@", plus C supports also smaller charset,
> with trigraps [preprocessor] and/or new bigraphs [compiler])

I'm not sure how relevant this is.  This is specified as the minimum
requirement by the *C standard*.  But, it's the *minimum* requirement.
GCC supports full use of UTF-8 (or whatever) encoded sources, and I
want to make better use of it, while still remaining in compliance
with the standard (which it is--I've read the ISO C standard relating
to source and execution character sets, and you're allowed to do better
than 7 bit ASCII!).

>> and then a translation library such
>> as libintl/gettext to get translated strings with the extended
>> characters in them.  This is workable, but it imposes a big burden on
>> translators because I might want to use symbols and other characters
>> which are not part of a /language/ translation, but need adding by
>> each and every translator through explicit translator comments in the
>> sources.  This is tedious and error-prone.  If the sources were UTF-8
>> encoded, this would work perfectly since I could just use the
>> necessary UTF-8 characters directly in the source rather than abusing
>> the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
>> thus cuts out a big pile of cruft and complexity in sources which only
>> exists to cater for people who want to run your code in a C locale!
>> And the translators can completely ignore the now no longer needed
>> job of translating special characters as well doing as the actual
>> translation work, so the symbol usage is identical in all
>> translations, and their job is much easier.
>
> yes, in a perfect world we need only one charset (and maybe only
> one language and one locale). From all the proposals to reach this
> target, unicode and UTF-8 seems the best solution.
> But... for now take care about locales and don't assume UTF-8,
> or you will cause trouble with a lot of non-UTF-8 users.
> Converting locale (from non-UTF-8 to UTF-8) is simple for
> English and few European languages, but it is a tedious work
> for many user: it need a "flag day", in which I should convert
> all my files to UTF-8 or annotate every file with the right
> encoding (most of editors and tools understands such annotations).

I have never *ever* suggested that we only use one charset.  I'm only
suggesting that the *C locale* must be UTF-8 in order to allow for
full UTF-8 support.  Normal user locales can use whatever charset
they like.

Non-UTF-8 users won't be disadvantaged because the UTF-8 exec charset
will be recoded to their locale-specific output codeset, either by
libc or gettext.

The C locale is special in that normal users won't use it, but
system programs and code needing locale independence do use it.
Any program wanting to work correctly in a C locale must only use
ASCII or it *breaks*.  This means we are /de facto/ restricted to
ASCII unless we take special effort to work around the fact (and
this was the point of my l10n/i18n comments above).

Most programs do need to work correctly in a C locale, and so can't
use UTF-8 either as a source or exec charset.  This is a severe
limitation.

> So for now we support UTF-8, we try to set UTF-8 default to
> new users, and UTF-8 is the encoding for debian files in packages.
> But it will take a lot of years (or maybe never) before
> we can assume UTF-8 if user don't loudly tell the system to
> use other encodings.

We're at that point now, but this really is not relevant to the
purpose of this discussion.

>  > I've tested all this, and it all works *perfectly*.  Except that if
>> you do this, your program will not run in the C locale (and *only*
>> the C locale) due to having completely borked output.
>
> It is the terminal, not the C program.

No, it is the program.  I have tested this with different terminal
input encodings and by examining the program output byte-by-byte
(as my test program shows).

>>  A C.UTF-8 would
>> be a solution to this problem, and allow full use of the *existing*
>> UTF-8 string handling which all sources are built with, yet only a
>> tiny fraction dare to use.  Note that gettext is *completely disabled*
>> if used in a C locale, and this does additional mangling in addition
>> to the plain libc damage, resulting in *no output at all*!  (I would
>> need to double check that; this was the case when I last looked,
>> and the reason I had to abandon use of UTF-8 string literals.)
>
> Use "en_US.UTF-8".

Why?  Did you actually understand the rationale I provided above.
I could use en_US.UTF-8, or any locale.  But the point is that the
code works in all locales *except* C.

> "C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
> for machine".

No.  "C" and "POSIX" mean the /default/ POSIX-specified locale.  And
there's nothing written in the standard that restricts that locale
to 7-bit ASCII as its codeset.  There are UNIX systems out there
right now using UTF-8 in their C locale.

> Do we need to translate all strings also on C.UTF-8?

Of course not.  We don't do any translation in the C locale.  The
only difference is the character encoding, which is backward
compatible with ASCII in any case.

> Which alphabetic characters?  Which numeric characters?  Which
> alphabetic order? etc. etc.  You see: it is difficult to create
> a new locale, and people must understand the meaning of such locale
> (without reading all the locale definition).

For a minimal locale it could just use strict numerical ordering.
It should probably copy what existing systems using UTF-8 C locales do.

>> Regarding the standards conformance of using a UTF-8 C locale:
>> I've spent some time reading the standards (SUSv3), and see no reason
>> why C can't use UTF-8 as its default codeset and still remain strictly
>> conforming.
>
> UTF-8 as a lot of characters (alphabetic, numeric, white).
> C locale requires that whitespace are only SPACE and TAB.

Where is this requirement?  Can you point me to the SUSv3 definition?

> I did look for all requirement, but I found that some requirement
> are incompatible from what one should expect.

Again, do you have references or examples?

> So a C local in UTF-8 would cause more trouble to users (no warning,
> but the whitespace are missinterpreted (note: some windows editors
> are know to insert a lot of non standard whitespace, instead of spaces).

Huh?

>> The standards specifies a minimum requirement of a portable character
>> set and control character set.  This is satisfied by the 7-bit ASCII
>> encoding which we currently use as the C0 and G1 control and graphics
>> sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
>> it is eminently reasonable to use UTF-8 *and still remain conforming*
>> with the minimum functionality required by the standard.  It's
>> explicity spelled out in SUSv2, though the wording was dropped in
>> SUSv3 (definitely not forbidden, though).
>>
>> POSIX/C locale:
>> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02
>>
>> Portable charset:
>> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06
>>
>> "Implementations may also add other characters."
>> This is from the charset documentation in SUSv2
>> http://opengroup.org/onlinepubs/007908775/xbd/charset.html
>>
>>
>> UTF-8 is the default character set on Debian GNU/Linux.  It's what
>> we all use, it's what all the tools use, and the C locale is the
>> last ASCII holdout.  It would make the lives of many maintainers
>> and users more bearable if it was also UTF-8, as well as getting
>> rid of the current buggy behaviour if you use UTF-8-encoded sources.
>> It's currently *the only blocker* preventing us using UTF-8 encoded
>> sources.

Just to clarify what I meant here.  We can currently support any
locale using charset recoding via iconv, or by abusing gettext
(which does recoding as a side effect of its main purpose of
translating text).  This works for all locales except C, where
it doesn't do any translation.

If the C locale used UTF-8, the UTF-8 strings in the sources would
display correctly in the absence of any recoding or translating
machinery (which is effectively what happens in the C locale).  This
is pretty much the crux of the point I'm trying to make.

Solely due to the C locale being a throwback to the 1960s, we are not
able to make use UTF-8 encoded sources or strings unless the C locale
changes.  It's just this one locale.

> I think ASCII 7 would simplify the finding bugs.

In what context?

> An c>127 in a C locale is simply wrong, it will miss interpreted
> by different terminal (local and remote, etc.).

Err, why?  This is a recursive argument.  If the C locale used UTF-8,
then c>127 would be perfectly OK.  And code which does things based
on the locale charset should check the locale charmap if it's
important.

> Not always, there are terminal libraries and standard libraries that
> do the right things, but with your proposal, I think in few months
> programs will simply write UTF-8 to terminal, ignoring charset
> choose by user.

Correctly written programs will always use the locale chosen by the
user.  I have not ever said I wanted to ignore the user's charset:
I don't.  They can select (or make) any locale of their choosing,
without it affecting anything to do with the C locale.

> Before was: all must use English because I understand English
> now we want: all must use UTF-8 because I use UTF-8?

Err, *no*.  Whatever gave you that idea?

> If English is the most spoken language (and easier to type), or
> that UTF-8 is technically very good, doesn't mean that we
> should oblige users to use English or UTF-8.

Err, I'm not doing *either*.  I'm talking about the C locale only,
which isn't a locale any *user* should be choosing unless they
want untranslated (English or whatever the programmer used) text.

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>

References:
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Thorsten Glaser <tg@mirbsd.de>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Steve Langasek <vorlon@debian.org>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: Roger Leigh <rleigh@codelibre.net>
- Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
  - From: "Giacomo A. Catenazzi" <cate@debian.org>

Prev by Date: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Next by Date: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Previous by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Next by thread: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Index(es):
- Date
- Thread