range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)

To: Marko Kreen <marko@l-t.ee>, 281368@bugs.debian.org
Cc: debian-i18n@lists.debian.org
Subject: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
From: Fumitoshi UKAI <ukai@debian.or.jp>
Date: Tue, 16 Nov 2004 12:31:19 +0900
Message-id: <[🔎] 87fz3a31x4.wl@ukai.org>
In-reply-to: <20041115125837.2849B2E429@grue.l-t.ee>
References: <20041115125837.2849B2E429@grue.l-t.ee>

Hi,

I forwarded it to debian-i18n too, because it seems that many users 
misunderstand range of characters machtes in regexp in locale aware
environment.

At Mon, 15 Nov 2004 14:58:37 +0200,
Marko Kreen wrote:

> Tried compiling cross-compiler, configure script
> failed with random error messages.
> 
> Tracked the bug to following testcase:
> 
> ==========================================
> #! /bin/sh
> 
> # regex from configure:
> cf_test () {
>   echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'
> }
> 
> run () {
>   echo --- LANG=$LANG ---
>   cf_test --build=i386-linux
>   cf_test --host=i386-linux
>   cf_test --target=sparc-linux
> }
> 
> # buggy:
> LANG=et_EE.UTF-8; export LANG; run
> # those work:
> LANG=C; export LANG; run
> LANG=en_US.UTF-8; export LANG; run
> =========================================
> 
> Output:
> =========================================
> --- LANG=et_EE.UTF-8 ---
> --bui386-linux
> --hosti386-linux
> --targetsparc-linux
> --- LANG=C ---
> i386-linux
> i386-linux
> sparc-linux
> --- LANG=en_US.UTF-8 ---
> i386-linux
> i386-linux
> sparc-linux
> =========================================
> 
> I must say this makes et_EE.UTF-8 locale unusable, as I cant trust what
> happens in configure or rc.d scripts...
> 
> Btw, it seems similar to gawk multibyte bugs, maybe you could lift a fix
> from there?

I confirmed that the result is the same as yours.
However, I think this is not a bug of sed, and this is wrong expectation
about range of character matches in regexp in locale aware environment.
Many users think "[a-z]" will matches as same as "[abcdefghijklmnopqrstuvwxyz]".
It actually is true in C locale and most locales, such as en_US.UTF-8,
but it is not true in some locales. 
The regexp "[a-z]" means "matches any character between 'a' and 'z', and 
its range is determined by collate order defined by LC_COLLATE (or wcscoll(3)).

In et_EE locale, collate order would be:
 a b c d e f g h i j k l m n o p q r s z t u v w x y
according to LC_COLLATE in /usr/share/i18n/locales/et_EE.

So, "[a-z]" will be "[abcdefghijklmnopqrsz]", and it won't match with any of
't' 'u' 'v' 'w' 'x' 'y',  because these are out of range of a-z.

This is why "[-_a-zA-Z0-9]*=" will matches "ild=" of "--build=i386-linux"
(note that 'u' is out of range of the "[-_a-zA-Z0-9]") and the result becomes
"--bui386-linux" ("ild="=>"" by 's/[-_a-zA-Z0-9]*=//', so 
"--bu" + "i386-linux" is the result).

In this case, correct regexp whould be
 s/[-_[:alpha:][:digit:]]*=//

If you think this collate order is a bug, you should claim this 
against locales package.

Regards,
Fumitoshi UKAI

Reply to:

Follow-Ups:
- Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
  - From: Marko Kreen <marko@l-t.ee>

Prev by Date: Re: Hello
Next by Date: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Previous by thread: Re: Hello
Next by thread: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Index(es):
- Date
- Thread