Bug#1020654: C.UTF-8: surprising differences in character classes

To: Thorsten Glaser <tg@mirbsd.de>, 1020654@bugs.debian.org
Subject: Bug#1020654: C.UTF-8: surprising differences in character classes
From: Aurelien Jarno <aurelien@aurel32.net>
Date: Sun, 25 Sep 2022 09:09:33 +0200
Message-id: <Yy/+ragUwcXdvuxP@aurel32.net>
Reply-to: Aurelien Jarno <aurelien@aurel32.net>, 1020654@bugs.debian.org
In-reply-to: <[🔎] 166405420787.5686.2169783411397257167.reportbug@tglase.lan.tarent.de>
References: <[🔎] 166405420787.5686.2169783411397257167.reportbug@tglase.lan.tarent.de> <[🔎] 166405420787.5686.2169783411397257167.reportbug@tglase.lan.tarent.de>

Hi,

On 2022-09-24 23:16, Thorsten Glaser wrote:
> Package: locales
> Version: 2.35-1
> Severity: normal
> X-Debbugs-Cc: tg@mirbsd.de
> 
> While adjusting my localedata patch script to the latest glibc uploads
> I discovered a surprising difference in some categories — for example:

Starting with glibc 2.35, we do not patch the glibc to add C.UTF-8
support, instead we use the upstream code which comes with the following
NEWS entry [1]:

* Support for the C.UTF-8 locale has been added to glibc.  The locale
  supports full code-point sorting for all valid Unicode code points.  A
  limitation in the framework for fnmatch, regexec, and regcomp requires
  a compromise to save space and only ASCII-based range expressions are
  supported for now (see bug 28255).  The full size of the locale is
  only ~400KiB, with 346KiB coming from LC_CTYPE information for
  Unicode.  This locale harmonizes downstream C.UTF-8 already shipping
  in various downstream distributions.  The locale is not built into
  glibc, and must be installed.

The point of having it merged upstream, is that all distributions will
now use the same definition for the C.UTF-8 locale, which was not the
case before.

> (sid-amd64)tglase@tglase:~ $ LC_ALL=C ./tstspc
> U+0009
> U+000A
> U+000B
> U+000C
> U+000D
> U+0020
> (sid-amd64)tglase@tglase:~ $ LC_ALL=C.UTF-8 ./tstspc
> U+0009
> U+000A
> U+000B
> U+000C
> U+000D
> U+0020
> U+1680
> U+2000
> U+2001
> U+2002
> U+2003
> U+2004
> U+2005
> U+2006
> U+2008
> U+2009
> U+200A
> U+2028
> U+2029
> U+205F
> U+3000

This is expected given the LC_CTYPE information used for the C.UTF-8
comes from Unicode.

> The test program is thus: gcc -O2 -Wall -Wextra -Wformat -o tstspc tstspc.c
> 
> //--------------------------------cut-here------------------------------

[snip]

> //--------------------------------cut-here------------------------------
> 
> 
> In my localedata patch script, I take specific care to change the
> copy of i18n_ctype before applying it to C.UTF-8 as follows:
> 
> space → <U0009>..<U000D>;<U0020>
> cntrl → <U0000>..<U001F>;<U007F>
> blank → <U0009>;<U0020>
> 
> They are as mandated by POSIX for the C locale. I believe I said
> in my original 2013 proposal for a C.UTF-8 locale that it should
> be as close to C as possible while using UTF-8 as encoding.

Those are mandated for the POSIX C locale, but POSIX does not say
anything (yet) about the C.UTF-8 locale. The choice made by upstream has
been discussed during many years [2], if you disagree with it, please
come back to upstream.

Regards
Aurelien

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS;h=faa7ec1871da1a34ed943fd8d406496e58fb2c2e;hb=f94f6d8a3572840d3ba42ab9ace3ea522c99c0c2
[2] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

Reply to:

Follow-Ups:
- Bug#1020654: C.UTF-8: surprising differences in character classes
  - From: Thorsten Glaser <tg@mirbsd.de>

References:
- Bug#1020654: C.UTF-8: surprising differences in character classes
  - From: Thorsten Glaser <tg@mirbsd.de>

Prev by Date: Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Next by Date: Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Previous by thread: Bug#1020654: C.UTF-8: surprising differences in character classes
Next by thread: Bug#1020654: C.UTF-8: surprising differences in character classes
Index(es):
- Date
- Thread