[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: grep / sed + regex : possible bug ?



On Thu, Mar 20, 2003 at 03:36:28AM +0100, Axel Schlicht wrote:
> But, as grep and sed only operate on line levels (altho sed can work on
> multiple lines with some tweaking) the $ should play no role here
> So
> grep  '/Name/[^/][^/]*'
> should mean
> find a line with /Name/ anywhere
> the find at least one (first [^/]) character <> '/'
> then find no more or any number of characters <> '/'
> but once you find a '/' the mission is over and you have to drop that
> line

No, you misunderstand. Your regular expression matches "/Name/" followed
by one or more non-/ characters. It says nothing about what is allowed
to follow those non-/ characters. If you want the non-/ characters to
extend until the end of the line - that is, you want no / characters
until the end of the line - you *must* anchor the regular expression
using a final $.

> No for the false matches
> blaba/Name/aaa/1 : 
> blaba/Name/     : possible hit : state : valid
> blaba/Name/a    : no '/', : state still valid : let's go on
> blaba/Name/aa   : no '/', : state still valid : let's go on
> blaba/Name/aaa  : no '/', : state still valid : let's go on
> blaba/Name/aaa/ : '/' read : state invalid : let's get out of here :

You've definitely misunderstood how unanchored regular expressions work.
In general, tools that handle regular expressions do *not* require them
to match all the input, so your "state invalid" actually means "we ran
off the end of the regular expression before we ran out of input, but
that's OK".

In other words, when your regex is applied to "blaba/Name/aaa/", it
successfully matches the "blaba/Name/aaa" portion of the input, and
since you have placed no constraint on it to match the entire line it
feels no obligation to worry about the trailing /. In sed, you'll find
that the special character & on the replacement side of an s/// command
refers to "that portion of the pattern space which matched" (from the
sed(1) man page), clearly implying that the entire pattern space does
not necessarily have to match.

> So why does grep / sed report them, also they violate the limits of the
> regex, as is:
> not '/' after /Name/, period.

Because those tools, like almost all others, are content to match
regexes against substrings of the input. Your regular expression does
limit what it matches, but not in the way you think it does.

> THere should only be one possible explanation
> the preceeding .* may go haywire and read up to the end of the line
> before thinking on matching anything else, but although greedy, sed and
> the like should only be greedy up to a point, that is .*Anything will be
> interpreted as read as much as you like, but once you meet an Anything
> you'll stop.

That is also incorrect, and in fact completely misses the point of
greedy quantifiers. /.*Anything/ applied to "fooAnythingbarAnything"
matches the entire string, not just "fooAnything". If you're using
Perl-style regexes you can use /.*?Anything/ to modify this behaviour.

Cheers,

-- 
Colin Watson                                  [cjwatson@flatline.org.uk]



Reply to: