[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: egrep oddity



Thanks for this illuminating response to what I thought
might have been mere user naivete. 

On Sun, Feb 05, 2012 at 05:55:48PM -0700, Bob Proulx wrote:
> Neal Murphy wrote:
> > For quite some time now, I've been getting peeved with egrep not
> > doing what it should.
> 
> You don't like it and I don't like it but the powers that be have
> decided that within a locale, within libc, character collation
> sequences will be dictionary ordering where case is folded and
> punctuation is ignored.  They failed to see how this would negatively
> impact almost everything.  Creeping features.
> 
> And because punctuation is ignored it causes a lot of problems with
> utilities such as sort.  You didn't have to say LC_ALL=C for the first
> thirty years.  But you do now.  (Or at least since the mid 1990's.)
> In almost all scripts dealing with sort ordering you will find it
> necessary to set LC_ALL=C to get expected results.  I have been a
> rather outspoken critic of this design decision on other mailing
> lists.
> 
> > I have Squeese installed and up-to-date. In an xterm running bash or on a 
> > console running bash or dash, this command:
> >   ls -C1 | egrep "^[A-Z]"
> > returns all lines except those beginning with 'a'.
> 
> The collation sequence of [a-z] in dictionary ordering is really
> "aAbBcC...xXyYzZ" and not "abc...z".  So when you say "[a-z]" you are
> getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> really getting "AbBcC...xXyYzZ" with 'A'!

Holy rat piss! Understanding this is a lot to be asking
someone new (or even old) to a Unixlike environment.
 
I usually recommend that people learn Unix tools (it's fun!), but
this is exceptionally obtuse, if not downright unfriendly.

> In the en_US.UTF-8 locale (for example) what would traditionally have
> been [A-Z] and [a-z] now must be specified as [[:upper:]] and
> [[:lower:]] instead.
> 
> [An Aside: I do find [:space:] a useful character class meaning any of
> the whitespace characters.  It is posix complicant and can be cut and
> pasted nicely.  Of course perl heads will want to use \s and \S.]

> In better news, after years and years of dealing with this problem,
> there is now a move by applications (both gnu awk and gnu grep IIRC,
> awk is in experimental now) to reverse this behavior in the userland
> code.  So what libc has put in will finally be reversed by
> applications voting with code to take it out.  The newest gnu awk is
> re-implementing ranges A-Z and a-z as you would expect.
> 
> Here is a reference:
> 
>     http://www.gnu.org/s/gawk/manual/html_node/Ranges-and-Locales.html
> 
> > Even the following commands exhibit similar behavior:
> >
> >   alias|sed -e 's/^a/b/'|egrep "^[A-Z]"  # passes sed's output untouched
> >   alias|sed -e 's/^a/A/'|egrep "^[A-Z]"  # passes sed's output untouched
> > 
> > These commands behave the same way on another Squeeze installation
> > at another location. Also, 'grep -E' behaves the same way.
> > 
> > The commands behave as expected on a different GNU/Linux system.
> > 
> > Does anyone else see this behavior? Or do I need to clean my pipe and smoke 
> > something better?
> 
> The character collation sequence affects almost everything on the
> system that sorts.  This includes commands such as 'ls' and also your
> shell (e.g. 'echo *') too.  Plus things like 'expr'.  Everything that
> uses libc strcoll(3) and that is pretty much everything.  It is a part
> of the conversion for a program to be multi-byte character aware.
> Programs were converted en masse in the 90's in order to support UTF-8
> and non-English locales.  Mostly that was good.  But for things like
> this I think it was quite bad.
> 
> Personally I have the following in my $HOME/.bashrc file.
> 
>   export LANG=en_US.UTF-8
>   export LC_COLLATE=C

Thanks that's very useful. 
 
> That sets most of my locale to a UTF-8 one but forces sorting to be
> standard C/POSIX.  This probably won't work in the general case since
> I have no idea how that would interact with all character sets.  I
> expect it will interact very badly with big5 for example.  And I don't
> know how it deals with other non-english character sets.  But it gives
> me some relief.
> 
> Bob



-- 
Joel Roth


Reply to: