Re: egrep oddity
On 2012-02-05 17:55:48 -0700, Bob Proulx wrote:
> The collation sequence of [a-z] in dictionary ordering is really
> "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are
> getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
> really getting "AbBcC...xXyYzZ" with 'A'!
This is not what I observe (though I was expecting this behavior)
on Debian/unstable. Is it a bug?
xvii% export LC_ALL=en_US.utf8
xvii% locale
LANG=POSIX
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
xvii% echo BC | grep '[a-z]'
xvii% echo BC | grep '[A-z]'
grep: Invalid range end
xvii% echo BC | LC_ALL=C grep '[A-z]'
BC
The test with '[A-z]' shows that something happens with the collating
rules, but then I would have expected
echo BC | grep '[a-z]'
to output BC. At least "sort" seems to behave as expected:
xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort
AB
BC
CD
ab
bc
cd
xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort
ab
AB
bc
BC
cd
CD
> In better news, after years and years of dealing with this problem,
> there is now a move by applications (both gnu awk and gnu grep IIRC,
> awk is in experimental now) to reverse this behavior in the userland
> code.
Perhaps this explains what I'm seeing with grep, except for [A-z].
But the grep man page still says:
Within a bracket expression, a range expression consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, inclusive, using the locale's
collating sequence and character set. For example, in the default C
locale, [a-d] is equivalent to [abcd]. Many locales sort characters in
dictionary order, and in these locales [a-d] is typically not
equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
To obtain the traditional interpretation of bracket expressions, you
can use the C locale by setting the LC_ALL environment variable to the
value C.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Reply to: