On Sun, Nov 28, 2010 at 05:21:33PM +0000, Thorsten Glaser wrote:
> Fun to be reading this. Me like ;-)
> Anyway. With my Debian hat on, the C/POSIX locales must not use
> UTF-8 as encoding, because otherwise, all kind of hell breaks
> loose (consider running 'tr u x' on a binary or other legacy
> encoded text file, and tr is just an example).

From my reading of the standards a UTF-8 C locale would be required
to behave identically to the existing ASCII C locale:

• will consider all byte sequences valid
• will use only the ASCII collation sequences (LC_COLLATE would be
• LC_CTYPE would probably also be identical (SUS specifies this
  less strictly than LC_COLLATE), but for backward compatibility
  should probably remain the same.

About the only difference would be the lack of a need for the
transliteration table, and the fact that the nl_langinfo(CODESET)
would return UTF-8.  That's pretty much it.

I'd like to persue this in the long term, but I doubt I'll have the
time to commit to it for several months.  If anyone else wishes to
tackle it, feel free to go for it!

> There are plans
> for C.UTF-8 though, and I’m a bit ashamed at having slacked off
> there…

No worries, there's not much going to happen at this stage in the
squeeze freeze.  Hopefully easy to get added early in the wheezy
cycle though!

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776 (the very end)
and #609306 (same bug but a feature request for eglibc).


