Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)

To: Fumitoshi UKAI <ukai@debian.or.jp>
Cc: 281368@bugs.debian.org, debian-i18n@lists.debian.org
Subject: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
From: Marko Kreen <marko@l-t.ee>
Date: Tue, 16 Nov 2004 12:07:10 +0200
Message-id: <[🔎] 20041116100710.GB7031@l-t.ee>
In-reply-to: <[🔎] 87fz3a31x4.wl@ukai.org>
References: <20041115125837.2849B2E429@grue.l-t.ee> <[🔎] 87fz3a31x4.wl@ukai.org>

On Tue, Nov 16, 2004 at 12:31:19PM +0900, Fumitoshi UKAI wrote:
> At Mon, 15 Nov 2004 14:58:37 +0200,
> Marko Kreen wrote:
> > # regex from configure:
> >   echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'

> In et_EE locale, collate order would be:
>  a b c d e f g h i j k l m n o p q r s z t u v w x y
> according to LC_COLLATE in /usr/share/i18n/locales/et_EE.
> 
> So, "[a-z]" will be "[abcdefghijklmnopqrsz]", and it won't match with any of
> 't' 'u' 'v' 'w' 'x' 'y',  because these are out of range of a-z.
> 
> This is why "[-_a-zA-Z0-9]*=" will matches "ild=" of "--build=i386-linux"
> (note that 'u' is out of range of the "[-_a-zA-Z0-9]") and the result becomes
> "--bui386-linux" ("ild="=>"" by 's/[-_a-zA-Z0-9]*=//', so 
> "--bu" + "i386-linux" is the result).
> 
> In this case, correct regexp whould be
>  s/[-_[:alpha:][:digit:]]*=//
> 
> If you think this collate order is a bug, you should claim this 
> against locales package.

AFAIR the collate order is right.  And now I think about it,
that means et_EE.iso-8859-1 will fail too...

You said "many users misunderstand range of characters machtes"
but the regex is from autoconf code - should use of a-z be
reported as bug in package or should users in weird locales
expect such failures?

At least I set root locale to en_US ...

-- 
marko

Reply to:

Follow-Ups:
- Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
  - From: Fumitoshi UKAI <ukai@debian.or.jp>

References:
- range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
  - From: Fumitoshi UKAI <ukai@debian.or.jp>

Prev by Date: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Next by Date: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Previous by thread: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Next by thread: Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)
Index(es):
- Date
- Thread