range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Hi,
I forwarded it to debian-i18n too, because it seems that many users
misunderstand range of characters machtes in regexp in locale aware
environment.
At Mon, 15 Nov 2004 14:58:37 +0200,
Marko Kreen wrote:
> Tried compiling cross-compiler, configure script
> failed with random error messages.
>
> Tracked the bug to following testcase:
>
> ==========================================
> #! /bin/sh
>
> # regex from configure:
> cf_test () {
> echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'
> }
>
> run () {
> echo --- LANG=$LANG ---
> cf_test --build=i386-linux
> cf_test --host=i386-linux
> cf_test --target=sparc-linux
> }
>
> # buggy:
> LANG=et_EE.UTF-8; export LANG; run
> # those work:
> LANG=C; export LANG; run
> LANG=en_US.UTF-8; export LANG; run
> =========================================
>
> Output:
> =========================================
> --- LANG=et_EE.UTF-8 ---
> --bui386-linux
> --hosti386-linux
> --targetsparc-linux
> --- LANG=C ---
> i386-linux
> i386-linux
> sparc-linux
> --- LANG=en_US.UTF-8 ---
> i386-linux
> i386-linux
> sparc-linux
> =========================================
>
> I must say this makes et_EE.UTF-8 locale unusable, as I cant trust what
> happens in configure or rc.d scripts...
>
> Btw, it seems similar to gawk multibyte bugs, maybe you could lift a fix
> from there?
I confirmed that the result is the same as yours.
However, I think this is not a bug of sed, and this is wrong expectation
about range of character matches in regexp in locale aware environment.
Many users think "[a-z]" will matches as same as "[abcdefghijklmnopqrstuvwxyz]".
It actually is true in C locale and most locales, such as en_US.UTF-8,
but it is not true in some locales.
The regexp "[a-z]" means "matches any character between 'a' and 'z', and
its range is determined by collate order defined by LC_COLLATE (or wcscoll(3)).
In et_EE locale, collate order would be:
a b c d e f g h i j k l m n o p q r s z t u v w x y
according to LC_COLLATE in /usr/share/i18n/locales/et_EE.
So, "[a-z]" will be "[abcdefghijklmnopqrsz]", and it won't match with any of
't' 'u' 'v' 'w' 'x' 'y', because these are out of range of a-z.
This is why "[-_a-zA-Z0-9]*=" will matches "ild=" of "--build=i386-linux"
(note that 'u' is out of range of the "[-_a-zA-Z0-9]") and the result becomes
"--bui386-linux" ("ild="=>"" by 's/[-_a-zA-Z0-9]*=//', so
"--bu" + "i386-linux" is the result).
In this case, correct regexp whould be
s/[-_[:alpha:][:digit:]]*=//
If you think this collate order is a bug, you should claim this
against locales package.
Regards,
Fumitoshi UKAI
Reply to: