[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: egrep oddity



Vincent Lefevre wrote:
> Bob Proulx wrote:
> > The collation sequence of [a-z] in dictionary ordering is really
> > "aAbBcC...xXyYzZ" and not "abc...z".  So when you say "[a-z]" you are
> > getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> > really getting "AbBcC...xXyYzZ" with 'A'!
> 
> This is not what I observe (though I was expecting this behavior)
> on Debian/unstable. Is it a bug?

To me it just tells me you are running Sid/Testing with the newer
grep.  Try it on a Squeeze machine to observe the previous behavior.

Squeeze released with 2.6.3 but Sid currently has 2.10.  Etch released
with 2.5.3.

Here is the upstream discussion of rational ranges:

  http://lists.gnu.org/archive/html/bug-grep/2011-11/msg00106.html
Continues:
  http://lists.gnu.org/archive/html/bug-grep/2011-12/msg00003.html
Implementation:
  http://lists.gnu.org/archive/html/bug-grep/2012-01/msg00088.html

I haven't been following the upstream grep project closely and so I am
making some assumptions which may be incorrect.  But the behavior
matches what I am seeing so seems reasonable to assume it.  Confusing
things is that there were Debian specific range patches in grep that
have been noted as coming and going in the changelog.  Seeing those
worries me about my assumptions.  But don't have the time to pull the
code and look just to satisfy my curiosity.  If you find out otherwise
I would be interested in knowing.

> xvii% export LC_ALL=en_US.utf8
> xvii% echo BC | grep '[a-z]'

In Sid with 2.10, yes.  With grep 2.6.3 in Squeeze:

  $ echo BC | LC_ALL=en_US.UTF-8 grep '[a-z]'
  BC

> At least "sort" seems to behave as expected:
> 
> xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort
> AB
> BC
> CD
> ab
> bc
> cd
> xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort
> ab
> AB
> bc
> BC
> cd
> CD

For more on this topic try it with some punctuation (which is ignored)
in place.  Since the punctuation is ignored it can produce some very
surprising sort results.

  $ printf '%s\n' AB A.B BC B.C CD ab bc b.c cd | LC_ALL=en_US.UTF-8
  sort
  ab
  AB
  A.B
  bc
  b.c
  BC
  B.C
  cd
  CD

  $ printf '%s\n' AB A.B BC B.C CD ab bc b.c cd | LC_ALL=C sort
  A.B
  AB
  B.C
  BC
  CD
  ab
  b.c
  bc
  cd

> But the grep man page still says:
> 
>   Within a  bracket  expression,  a  range  expression  consists  of  two
>   characters separated by a hyphen.  It matches any single character that
>   sorts  between  the  two  characters,  inclusive,  using  the  locale's
>   collating  sequence  and  character set.  For example, in the default C
>   locale, [a-d] is equivalent to [abcd].  Many locales sort characters in
>   dictionary   order,  and  in  these  locales  [a-d]  is  typically  not
>   equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
>   To  obtain  the  traditional interpretation of bracket expressions, you
>   can use the C locale by setting the LC_ALL environment variable to  the
>   value C.

I don't see any problem with that wording.  The opening for almost any
behavior comes from "using the locale's collating sequence and
character set" which isn't defined by grep but is defined by libc.
Was there something there in particular that you didn't like?

Fortunately setting LC_ALL=C converges all of the behavior across all
of the versions.  It would be a nightmare to keep track of all of the
individual versions and behaviors otherwise.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: