[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: grep / sed + regex : possible bug ?



Hi Clive
Thanks for the answer.
> No, it's not a bug. Regular expressions match substrings, not entire lines, unless constrained by anchors (^ or $) [1].
Of course.
But, as grep and sed only operate on line levels (altho sed can work on
multiple lines with some tweaking) the $ should play no role here
So
grep  '/Name/[^/][^/]*'
should mean
find a line with /Name/ anywhere
the find at least one (first [^/]) character <> '/'
then find no more or any number of characters <> '/'
but once you find a '/' the mission is over and you have to drop that
line
So it shoult match 
blaba/Name/aaa
blaba/Name/     : possible hit : state : valid
blaba/Name/a    : no '/', : state still valid : let's go on
blaba/Name/aa   : no '/', : state still valid : let's go on
blaba/Name/aaa  : no '/', : state still valid : let's go on
{} / EOL        : no more input, state valid, let's report a match
Some should apply here
  blaba/Name/bb
  blaba/Name/Cc&DD
With he anchor ('$') it should includ another step
blaba/Name/     : possible hit : state : valid
blaba/Name/a    : no '/', : state still valid : let's go on
blaba/Name/aa   : no '/', : state still valid : let's go on
blaba/Name/aaa  : no '/', : state still valid : let's go on
EOL             : no '/', EOL found, : maximal amount of input allowed
read : 
                  state valid : stop : let's report a match

Same, of course for
  blaba/Name/bb
  blaba/Name/Cc&DD

No for the false matches
blaba/Name/aaa/1 : 
blaba/Name/     : possible hit : state : valid
blaba/Name/a    : no '/', : state still valid : let's go on
blaba/Name/aa   : no '/', : state still valid : let's go on
blaba/Name/aaa  : no '/', : state still valid : let's go on
blaba/Name/aaa/ : '/' read : state invalid : let's get out of here :
report error
So, for grep / sed error means don't tell, so output shopuld be blank
Thus
blaba/Name/aaa/1
blaba/name/aaa/2
blaba/Name/bb/3
blaba/Name/Ccccc/5
blaba/Name/Cc&DD/2
must not be reported no matter whether there is aa anchor or not.
The moment they (the progs) encounter a '/' the DFA (Deterministic
Finite automaton) should switch to an invalid state and that's it for
that line. Dump it, let's not talk about it, forget it, you're out,
history, dead.
(Of course for the program there might be a chance that a little farther
down the line it might encounter a second /Name/ so it should go on,
realize there is not such thing and qietly give up.)

So why does grep / sed report them, also they violate the limits of the
regex, as is:
not '/' after /Name/, period.

> Without the $ anchor, the [^/]* matches as many non-/ characters as it can, and no more. The next / and any subsequent characters are ignored.
Yes, but after reading '/' after /Name/ they cannot read more chars, so
the have to balk out.

> However when followed by the $ anchor, the [^/]* must match non-/ characters all the way to the end of the line. This looks like a correct solution.
Sure, but once the found a character violating the condition : no '/'
after /Name/ the are expected to give up.
 
> Does it look any less strange now?
Honest answer, no.

THere should only be one possible explanation
the preceeding .* may go haywire and read up to the end of the line
before thinking on matching anything else, but although greedy, sed and
the like should only be greedy up to a point, that is .*Anything will be
interpreted as read as much as you like, but once you meet an Anything
you'll stop.

But a .* prefix does not change anything (quite correctly)

So the question remains, why don't they stop once they meet the first
'/' after /Name/?

Axel Schlicht



Reply to: