Re: egrep oddity

To: debian-user@lists.debian.org
Subject: Re: egrep oddity
From: Vincent Lefevre <vincent@vinc17.net>
Date: Mon, 6 Feb 2012 12:03:58 +0100
Message-id: <[🔎] 20120206110358.GB20394@xvii.vinc17.org>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20120206005548.GB31009@hysteria.proulx.com>
References: <[🔎] 201202051603.45935.neal.p.murphy@alum.wpi.edu> <[🔎] 20120206005548.GB31009@hysteria.proulx.com>

On 2012-02-05 17:55:48 -0700, Bob Proulx wrote:
> The collation sequence of [a-z] in dictionary ordering is really
> "aAbBcC...xXyYzZ" and not "abc...z".  So when you say "[a-z]" you are
> getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> really getting "AbBcC...xXyYzZ" with 'A'!

This is not what I observe (though I was expecting this behavior)
on Debian/unstable. Is it a bug?

xvii% export LC_ALL=en_US.utf8
xvii% locale
LANG=POSIX
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
xvii% echo BC | grep '[a-z]'
xvii% echo BC | grep '[A-z]'
grep: Invalid range end
xvii% echo BC | LC_ALL=C grep '[A-z]'
BC

The test with '[A-z]' shows that something happens with the collating
rules, but then I would have expected

  echo BC | grep '[a-z]'

to output BC. At least "sort" seems to behave as expected:

xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort
AB
BC
CD
ab
bc
cd
xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort
ab
AB
bc
BC
cd
CD

> In better news, after years and years of dealing with this problem,
> there is now a move by applications (both gnu awk and gnu grep IIRC,
> awk is in experimental now) to reverse this behavior in the userland
> code.

Perhaps this explains what I'm seeing with grep, except for [A-z].
But the grep man page still says:

  Within a  bracket  expression,  a  range  expression  consists  of  two
  characters separated by a hyphen.  It matches any single character that
  sorts  between  the  two  characters,  inclusive,  using  the  locale's
  collating  sequence  and  character set.  For example, in the default C
  locale, [a-d] is equivalent to [abcd].  Many locales sort characters in
  dictionary   order,  and  in  these  locales  [a-d]  is  typically  not
  equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
  To  obtain  the  traditional interpretation of bracket expressions, you
  can use the C locale by setting the LC_ALL environment variable to  the
  value C.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to:

Follow-Ups:
- Re: egrep oddity
  - From: Bob Proulx <bob@proulx.com>

References:
- egrep oddity
  - From: Neal Murphy <neal.p.murphy@alum.wpi.edu>
- Re: egrep oddity
  - From: Bob Proulx <bob@proulx.com>

Prev by Date: Re: A question about ssh-agent
Next by Date: Re: Gnome-keyring problem
Previous by thread: Re: egrep oddity
Next by thread: Re: egrep oddity
Index(es):
- Date
- Thread