[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: range of characters in regexp in locale aware environment (Re: Bug#281368: sed: regexes fail on et_EE.UTF-8 locale)



At Tue, 16 Nov 2004 12:07:10 +0200,
Marko Kreen wrote:
> On Tue, Nov 16, 2004 at 12:31:19PM +0900, Fumitoshi UKAI wrote:
> > At Mon, 15 Nov 2004 14:58:37 +0200,
> > Marko Kreen wrote:
> > > # regex from configure:
> > >   echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'
> 
> > In et_EE locale, collate order would be:
> >  a b c d e f g h i j k l m n o p q r s z t u v w x y
> > according to LC_COLLATE in /usr/share/i18n/locales/et_EE.
> > 
> > So, "[a-z]" will be "[abcdefghijklmnopqrsz]", and it won't match with any of
> > 't' 'u' 'v' 'w' 'x' 'y',  because these are out of range of a-z.
> > 
> > This is why "[-_a-zA-Z0-9]*=" will matches "ild=" of "--build=i386-linux"
> > (note that 'u' is out of range of the "[-_a-zA-Z0-9]") and the result becomes
> > "--bui386-linux" ("ild="=>"" by 's/[-_a-zA-Z0-9]*=//', so 
> > "--bu" + "i386-linux" is the result).
> > 
> > In this case, correct regexp whould be
> >  s/[-_[:alpha:][:digit:]]*=//
> > 
> > If you think this collate order is a bug, you should claim this 
> > against locales package.
> 
> AFAIR the collate order is right.  And now I think about it,
> that means et_EE.iso-8859-1 will fail too...
> 
> You said "many users misunderstand range of characters machtes"
> but the regex is from autoconf code - should use of a-z be
> reported as bug in package or should users in weird locales
> expect such failures?

AFAIK, autoconf code is expected to be run with C locale or without locale.
At the near top of configure, there are code: 

(autoconf 2.59 /usr/share/autoconf/autoconf/autoconf.m4f)
 # NLS nuisances.
 for as_var in \
  LANG LANGUAGE LC_ADDRESS LC_ALL LC_COLLATE LC_CTYPE LC_IDENTIFICATION \
  LC_MEASUREMENT LC_MESSAGES LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER \
  LC_TELEPHONE LC_TIME
 do
  if (set +x; test -z "`(eval $as_var=C; export $as_var) 2>&1`"); then
    eval $as_var=C; export $as_var
  else
    $as_unset $as_var
  fi
 done

(autoconf2.13 /usr/share/autoconf2.13/*)
 # NLS nuisances.
 # Only set these to C if already set.  These must not be set unconditionally
 # because not all systems understand e.g. LANG=C (notably SCO).
 # Fixing LC_MESSAGES prevents Solaris sh from translating var values in `set'!
 # Non-C LC_CTYPE values break the ctype check.
 if test "${LANG+set}"   = set; then LANG=C;   export LANG;   fi
 if test "${LC_ALL+set}" = set; then LC_ALL=C; export LC_ALL; fi
 if test "${LC_MESSAGES+set}" = set; then LC_MESSAGES=C; export LC_MESSAGES; fi
 if test "${LC_CTYPE+set}"    = set; then LC_CTYPE=C;    export LC_CTYPE;    fi

If there are no such code in configure script generated by autoconf, it 
would be a bug of autoconf.
But if you pick a regex from autoconf code, you should be aware of 
this issue.

Regards,
Fumitoshi UKAI



Reply to: