[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: egrep oddity



Neal Murphy wrote:
> For quite some time now, I've been getting peeved with egrep not
> doing what it should.

You don't like it and I don't like it but the powers that be have
decided that within a locale, within libc, character collation
sequences will be dictionary ordering where case is folded and
punctuation is ignored.  They failed to see how this would negatively
impact almost everything.  Creeping features.

And because punctuation is ignored it causes a lot of problems with
utilities such as sort.  You didn't have to say LC_ALL=C for the first
thirty years.  But you do now.  (Or at least since the mid 1990's.)
In almost all scripts dealing with sort ordering you will find it
necessary to set LC_ALL=C to get expected results.  I have been a
rather outspoken critic of this design decision on other mailing
lists.

> I have Squeese installed and up-to-date. In an xterm running bash or on a 
> console running bash or dash, this command:
>   ls -C1 | egrep "^[A-Z]"
> returns all lines except those beginning with 'a'.

The collation sequence of [a-z] in dictionary ordering is really
"aAbBcC...xXyYzZ" and not "abc...z".  So when you say "[a-z]" you are
getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
really getting "AbBcC...xXyYzZ" with 'A'!

In the en_US.UTF-8 locale (for example) what would traditionally have
been [A-Z] and [a-z] now must be specified as [[:upper:]] and
[[:lower:]] instead.

[An Aside: I do find [:space:] a useful character class meaning any of
the whitespace characters.  It is posix complicant and can be cut and
pasted nicely.  Of course perl heads will want to use \s and \S.]

In better news, after years and years of dealing with this problem,
there is now a move by applications (both gnu awk and gnu grep IIRC,
awk is in experimental now) to reverse this behavior in the userland
code.  So what libc has put in will finally be reversed by
applications voting with code to take it out.  The newest gnu awk is
re-implementing ranges A-Z and a-z as you would expect.

Here is a reference:

    http://www.gnu.org/s/gawk/manual/html_node/Ranges-and-Locales.html

> Even the following commands exhibit similar behavior:
>
>   alias|sed -e 's/^a/b/'|egrep "^[A-Z]"  # passes sed's output untouched
>   alias|sed -e 's/^a/A/'|egrep "^[A-Z]"  # passes sed's output untouched
> 
> These commands behave the same way on another Squeeze installation
> at another location. Also, 'grep -E' behaves the same way.
> 
> The commands behave as expected on a different GNU/Linux system.
> 
> Does anyone else see this behavior? Or do I need to clean my pipe and smoke 
> something better?

The character collation sequence affects almost everything on the
system that sorts.  This includes commands such as 'ls' and also your
shell (e.g. 'echo *') too.  Plus things like 'expr'.  Everything that
uses libc strcoll(3) and that is pretty much everything.  It is a part
of the conversion for a program to be multi-byte character aware.
Programs were converted en masse in the 90's in order to support UTF-8
and non-English locales.  Mostly that was good.  But for things like
this I think it was quite bad.

Personally I have the following in my $HOME/.bashrc file.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

That sets most of my locale to a UTF-8 one but forces sorting to be
standard C/POSIX.  This probably won't work in the general case since
I have no idea how that would interact with all character sets.  I
expect it will interact very badly with big5 for example.  And I don't
know how it deals with other non-english character sets.  But it gives
me some relief.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: