[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1017852: libc6: C locale is 7-bit (127 characters), must be 8-bit (256 characters) since POSIX Issue 7 TC2/Issue 8



control: severity -1 normal
control: tag -1 + upstream
control: retitle -1 libc6: mb* functions consider C locale as 7-bit (128 characters) instead of 8-bit (256 characters) since POSIX Issue 7 TC2/Issue 8

On 2022-08-21 16:23, наб wrote:
> Package: libc6
> Version: 2.33-8
> Severity: important
> 
> Dear Maintainer,
> 
> Consider the following reproducer:

[ snip ]
 
> This breaks all programs that expect to process text/data portably,
> since in LC_ALL=C half of all bytes collapse to one character

"breaks" is a bit strong there given that this behaviour of the C locale
has been there for decades. Note also that the C.UTF-8 helps there, even
if I agree that it should also work with the POSIX locale.

> (for sort this means that they all collate equally, &c., &c.)!

It depends what is used for sorting. For instance the sort(1) utility
behaves correctly with the C locale.

> Consider a diff of XBD 6.2 ("Character Encoding"), Issue 7 vs Issue 7 TC2:
> -- >8 --
> @@ -1768,9 +1664,13 @@
> 
>  <h3><a name="tag_06_02">   6.2 </a>Character Encoding</h3>
> 
> -<p>The POSIX locale contains the characters in <a href="#tagtcjh_3">Portable Character Set</a> , which have the properties listed
> -in <a href="../basedefs/V1_chap07.html#tag_07_03_01"><i>LC_CTYPE</i></a> . In other locales, the presence, meaning, and
> -representation of any additional characters are locale-specific.</p>
> +<p>The POSIX locale shall contain 256 single-byte characters including the characters in <a href="#tagtcjh_3">Portable Character
> +Set</a> and <a href="#tagtcjh_4">Non-Portable Control Characters</a>, which have the properties listed in <a href=
> +"../basedefs/V1_chap07.html#tag_07_03_01"><i>LC_CTYPE</i></a>. It is unspecified whether characters not listed in those two tables
> +are classified as <b>punct</b> or <b>cntrl</b>, or neither. Other locales shall contain the characters in <a href=
> +"#tagtcjh_3">Portable Character Set</a> and may contain any or all of the control characters identified in <a href=
> +"#tagtcjh_4">Non-Portable Control Characters</a>; the presence, meaning, and representation of any additional characters are
> +locale-specific.</p>
> 
>  <p>In locales other than the POSIX locale, a character may have a state-dependent encoding. There are two types of these
>  encodings:</p>
> -- >8 --

That comes for bug 663. However for the functions listed in that bug,
only the mb* functions are affected. The strcasecmp, strncasecmp,
toupper, tolower and is* functions behave as in the standard. 

Anyway please bring this issue upstream, as it has to be solved there. 

Regards
Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

Attachment: signature.asc
Description: PGP signature


Reply to: