Re: egrep oddity
Thanks for this illuminating response to what I thought
might have been mere user naivete.
On Sun, Feb 05, 2012 at 05:55:48PM -0700, Bob Proulx wrote:
> Neal Murphy wrote:
> > For quite some time now, I've been getting peeved with egrep not
> > doing what it should.
>
> You don't like it and I don't like it but the powers that be have
> decided that within a locale, within libc, character collation
> sequences will be dictionary ordering where case is folded and
> punctuation is ignored. They failed to see how this would negatively
> impact almost everything. Creeping features.
>
> And because punctuation is ignored it causes a lot of problems with
> utilities such as sort. You didn't have to say LC_ALL=C for the first
> thirty years. But you do now. (Or at least since the mid 1990's.)
> In almost all scripts dealing with sort ordering you will find it
> necessary to set LC_ALL=C to get expected results. I have been a
> rather outspoken critic of this design decision on other mailing
> lists.
>
> > I have Squeese installed and up-to-date. In an xterm running bash or on a
> > console running bash or dash, this command:
> > ls -C1 | egrep "^[A-Z]"
> > returns all lines except those beginning with 'a'.
>
> The collation sequence of [a-z] in dictionary ordering is really
> "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are
> getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> really getting "AbBcC...xXyYzZ" with 'A'!
Holy rat piss! Understanding this is a lot to be asking
someone new (or even old) to a Unixlike environment.
I usually recommend that people learn Unix tools (it's fun!), but
this is exceptionally obtuse, if not downright unfriendly.
> In the en_US.UTF-8 locale (for example) what would traditionally have
> been [A-Z] and [a-z] now must be specified as [[:upper:]] and
> [[:lower:]] instead.
>
> [An Aside: I do find [:space:] a useful character class meaning any of
> the whitespace characters. It is posix complicant and can be cut and
> pasted nicely. Of course perl heads will want to use \s and \S.]
> In better news, after years and years of dealing with this problem,
> there is now a move by applications (both gnu awk and gnu grep IIRC,
> awk is in experimental now) to reverse this behavior in the userland
> code. So what libc has put in will finally be reversed by
> applications voting with code to take it out. The newest gnu awk is
> re-implementing ranges A-Z and a-z as you would expect.
>
> Here is a reference:
>
> http://www.gnu.org/s/gawk/manual/html_node/Ranges-and-Locales.html
>
> > Even the following commands exhibit similar behavior:
> >
> > alias|sed -e 's/^a/b/'|egrep "^[A-Z]" # passes sed's output untouched
> > alias|sed -e 's/^a/A/'|egrep "^[A-Z]" # passes sed's output untouched
> >
> > These commands behave the same way on another Squeeze installation
> > at another location. Also, 'grep -E' behaves the same way.
> >
> > The commands behave as expected on a different GNU/Linux system.
> >
> > Does anyone else see this behavior? Or do I need to clean my pipe and smoke
> > something better?
>
> The character collation sequence affects almost everything on the
> system that sorts. This includes commands such as 'ls' and also your
> shell (e.g. 'echo *') too. Plus things like 'expr'. Everything that
> uses libc strcoll(3) and that is pretty much everything. It is a part
> of the conversion for a program to be multi-byte character aware.
> Programs were converted en masse in the 90's in order to support UTF-8
> and non-English locales. Mostly that was good. But for things like
> this I think it was quite bad.
>
> Personally I have the following in my $HOME/.bashrc file.
>
> export LANG=en_US.UTF-8
> export LC_COLLATE=C
Thanks that's very useful.
> That sets most of my locale to a UTF-8 one but forces sorting to be
> standard C/POSIX. This probably won't work in the general case since
> I have no idea how that would interact with all character sets. I
> expect it will interact very badly with big5 for example. And I don't
> know how it deals with other non-english character sets. But it gives
> me some relief.
>
> Bob
--
Joel Roth
Reply to: