[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: egrep oddity



On 2012-02-05 17:55:48 -0700, Bob Proulx wrote:
> The collation sequence of [a-z] in dictionary ordering is really
> "aAbBcC...xXyYzZ" and not "abc...z".  So when you say "[a-z]" you are
> getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> really getting "AbBcC...xXyYzZ" with 'A'!

This is not what I observe (though I was expecting this behavior)
on Debian/unstable. Is it a bug?

xvii% export LC_ALL=en_US.utf8
xvii% locale
LANG=POSIX
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
xvii% echo BC | grep '[a-z]'
xvii% echo BC | grep '[A-z]'
grep: Invalid range end
xvii% echo BC | LC_ALL=C grep '[A-z]'
BC

The test with '[A-z]' shows that something happens with the collating
rules, but then I would have expected

  echo BC | grep '[a-z]'

to output BC. At least "sort" seems to behave as expected:

xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort
AB
BC
CD
ab
bc
cd
xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort
ab
AB
bc
BC
cd
CD

> In better news, after years and years of dealing with this problem,
> there is now a move by applications (both gnu awk and gnu grep IIRC,
> awk is in experimental now) to reverse this behavior in the userland
> code.

Perhaps this explains what I'm seeing with grep, except for [A-z].
But the grep man page still says:

  Within a  bracket  expression,  a  range  expression  consists  of  two
  characters separated by a hyphen.  It matches any single character that
  sorts  between  the  two  characters,  inclusive,  using  the  locale's
  collating  sequence  and  character set.  For example, in the default C
  locale, [a-d] is equivalent to [abcd].  Many locales sort characters in
  dictionary   order,  and  in  these  locales  [a-d]  is  typically  not
  equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
  To  obtain  the  traditional interpretation of bracket expressions, you
  can use the C locale by setting the LC_ALL environment variable to  the
  value C.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: