Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

To: Aurelien Jarno <aurelien@aurel32.net>
Cc: debian-glibc@lists.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
From: Roger Leigh <rleigh@codelibre.net>
Date: Tue, 13 Sep 2011 21:45:40 +0100
Message-id: <[🔎] 20110913204540.GJ3245@codelibre.net>
In-reply-to: <[🔎] 20110913200300.GA31563@hall.aurel32.net>
References: <[🔎] E1R0G4B-0002L4-Hd@vasks.debian.org> <[🔎] 20110913145323.GA845@riva.dynamic.greenend.org.uk> <[🔎] 4E6F77BF.8010305@aurel32.net> <[🔎] 20110913160746.GC845@riva.dynamic.greenend.org.uk> <[🔎] 20110913200300.GA31563@hall.aurel32.net>

On Tue, Sep 13, 2011 at 10:03:01PM +0200, Aurelien Jarno wrote:
> On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote:
> > On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote:
> > > Yes similar problems have already been reported. This change has been
> > > done as a C locale should not have a collation order.
> > 
> > Why not?  Codepoint order collation is perfectly reasonable for a C
> > locale.  Lots of people use LC_COLLATE=C when all they want is for
> > things like [a-z] to work reasonably.
> > 
> 
> Because it is supposed to replace the C locale, so to follow POSIX
> rules like the C locale. I am personally not convinced that we should go
> that way, but people who have pushed for this locale (some of them
> Cc:ed) have made clear in bugs #522776 and #609306 that it should handle
> collation like a C locale.
> 
> Maybe they could follow-up this mail with their arguments.

OK, here goes ;-)

The "C.UTF-8" locale /is/ the "C" locale, extended to support UTF-8.
That is, it must support the *standard* behaviour mandated in the
C, POSIX and SUS standards, or else conforming applications will break.

This is the reference for the forthcoming SUSv4 locale definition:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html

This standard defines in detail exactly how various aspects of the
C/POSIX locale must behave.  Conforming applications can expect this
behaviour to be guaranteed by a conforming C library.  Some aspects
are strictly defined, while others offer the possibility for
extension.  Examples:

LC_CTYPE

upper
Define characters to be classified as uppercase letters. 
In the POSIX locale, the 26 uppercase letters shall be included:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

lower
Define characters to be classified as lowercase letters. 
In the POSIX locale, the 26 lowercase letters shall be included:
a b c d e f g h i j k l m n o p q r s t u v w x y z

digit
Define the characters to be classified as numeric digits. 
In the POSIX locale, only:
0 1 2 3 4 5 6 7 8 9 shall be included.

space
Define characters to be classified as white-space characters. 
In the POSIX locale, exactly <space>, <form-feed>, <newline>,
<carriage-return>, <tab>, and <vertical-tab> shall be included.

cntrl
Define characters to be classified as control characters. 
In the POSIX locale, no characters in classes alpha or print shall be
included.

xdigit
Define the characters to be classified as hexadecimal digits. 
In the POSIX locale, only:
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

blank
Define characters to be classified as <blank> characters. 
In the POSIX locale, only the <space> and <tab> shall be included.

toupper
Define the mapping of lowercase letters to uppercase letters. 
In the POSIX locale, at a minimum, the 26 lowercase characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
shall be mapped to the corresponding 26 uppercase characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

tolower
Define the mapping of uppercase letters to lowercase letters. 
In the POSIX locale, at a minimum, the 26 uppercase characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
shall be mapped to the corresponding 26 lowercase characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z

Summary:
• space, cntrl, xdigit, blank are specified exactly.  The C locale
  must only use the specified characters.  It can't be extended to
  support other characters since it explicitly states this is not
  allowed.
• upper, lower, toupper and tolower specify minimum requirements.
  It's permitted to extend these to support other characters.

LC_COLLATE
The standard specifies a linear incremental sort order from U+0000 to
U+007F.  That's strictly required by the standard.  There's a lot of
software out there which explicitly switches to the C locale (or just
setlocale(LC_COLLATE, "C")) to get a locale-independent guaranteed
known sort order.  If this was to be changed, a lot of software would
break.

My take on this is that a UTF-8 C locale should extend the ordering
so that it just sorts any UCS codepoint by value (i.e. U+0000 to
U+FFFF).  This extends the existing order cleanly, and I think
matches expectations of what the C locale provides.  Regarding
handling of non-UTF-8 input, I've not tested how it's handled for
regular locales.  AFAICT it sorts on UCS codepoints, so it would
probably have already discarded them during conversion?

While in an ideal world it would be great if the "C" locale could
provide the same level of UTF-8/UCS support as other "real" UTF-8
locales, the main issue is ensuring that we comply with the letter
of the standards here--unlike every other locale, this one is
explicitly defined to provide certain things.  The other
consideration is that the "C" locale is by definition a "minimal"
locale that provides a bare minimum of functionality; if you want to
use it to do advanced text processing, I think that's probably outside
its scope.  If we do want a universally available locale that does
provide this level of service, then we should probably name it to
something other than "C"; or simply mandate the existence of e.g.
en_US.UTF-8.

I'd really like to see this implemented directly in glibc, since
it's really just a simple modification of the existing hardcoded
C locale.  But it does require processing input/output as UTF-8 and
enabling some of the encoding-related stuff to correctly support
wide streams etc.  I think.  I did start hacking on it, but glibc
was mostly undocumented and very complex, so this never got anywhere.
I'll have another attempt sometime, but if you know anyone with more
familiarity with the source (or upstream!), that would probably be
a better plan.

As mentioned on IRC, I joined the Austin Group, and when I have time
I will ask them about UTF-8 support in the C locale, and how this
can be implemented in compliance with the standard.

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Reply to:

References:
- r4943 - in glibc-package/trunk/debian: . patches/localedata
  - From: Aurelien Jarno <aurel32@alioth.debian.org>
- Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
  - From: Colin Watson <cjwatson@debian.org>
- Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
  - From: Aurelien Jarno <aurelien@aurel32.net>
- Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
  - From: Colin Watson <cjwatson@debian.org>
- Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
  - From: Aurelien Jarno <aurelien@aurel32.net>

Prev by Date: Processed: tagging 641309
Next by Date: r4969 - in glibc-package/trunk/debian: . patches/localedata
Previous by thread: Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
Next by thread: Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
Index(es):
- Date
- Thread