[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: grep replacement using sed is behaving oddly



On Fri, Oct 21, 2022 at 02:15:01PM -0400, Greg Wooledge wrote:
> On Fri, Oct 21, 2022 at 08:01:00PM +0200, tomas@tuxteam.de wrote:
> > On Fri, Oct 21, 2022 at 01:21:44PM -0400, Gary Dale wrote:
> > > I'm hoping someone can tell me what I'm doing wrong. I have a line in a lot
> > > of HTML files that I'd like to remove. The line is:
> > > 
> > >             <hr  style="border-top: 1px solid rgb(0, 32, 159); margin:
> > > 0rem;">
> > > 
> > > I'm testing the sed command to remove it on just one file. When it works,
> > > I'll run it against *.html. My command is:
> > > 
> > >  sed -i -s 's/\s*\<hr\ \ style.*\>//g' history.html
> > > 
> > > Unfortunately, the replacement doesn't remove the line but rather leaves me
> > > with:
> > > 
> > >             <;">
> > 
> > This looks as if the <> in the regexp were interpreted as left and right
> > word boundaries (but that would only be the case if you'd given the -E
> > (or -r) option).
> > 
> > Try explicitly adding the --posix option, perhaps...
> 
> Gary is using non-POSIX syntax (specifically the \s), so that's not going
> to help unless he first changes his regular expression to be standard.

Yes, but he's telling sed to use POSIX aka "obsolete", following the jargon
of man (7) regex (by not overriding the default, which is POSIX/obsolete).
Unless something else is at work (@Gary: does "which sed" say /bin/sed?)

> I think you might be on to something with the \< and \> here.  I can see
> absolutely no reason why Gary put backslashes in front of spaces and
> angle brackets here.

They shouldn't do anything for spaces, since they are ordinary characters.
But HEY! I got that the wrong way around: escaping the <> makes them special:
Gary -- take away the backslashes from the angle brackets. That should help.
And as Greg says -- also from the spaces, that should unobfuscate your
regexp a bit.

All that said. deleting the line with sed is what you want, anyway, as
noted by Greg elsewhere in the thread.

> The backslashes in front of the spaces are probably
> just noise, and can be ignored.  The \< and \> on the other hand might
> be interpreted as something special, the same way \s is (because this is
> GNU sed, which loves to do nonstandard things).

No, you are absolutely correct, my mind had a twist. With -E, you can use
<> as word boundary matches, without the -E, those are \< and \>.

> 
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/<.*>//'
> abc  xyz
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/\<.*\>//'
> 
> unicorn:~$ 
> 
> So... yeah, \< and/or \> clearly have some special meaning to GNU sed.
> Good luck figuring out what that is.

Word boundaries: the zero-width string between the last non-word character
and the first word character ("<") and that one between the last word
character and the following non-word character (">"). PCRE has those,
too.

> For Gary's actual problem, simply removing the backslashes where they're
> not wanted would be a good start.  Actually learning sed could be step 2.

Exactly.

> I feel obliged at this point to mention that parsing HTML with regular
> expressions is a fool's errand, and that sed should not be the tool of
> choice here.  Nor should grep, nor any other RE-based tool.  This goes
> triple when one doesn't even know the correct syntax for their RE.

Definitely. As far as HTML is concerned, Gary's line

  <hr  style="border-top: 1px solid rgb(0, 32, 159); margin: 0em;">

is totally equivalent to

  <hr
   style="border-top: 1px solid rgb(0, 32, 159); margin: 0em;"
  >

(actually there must not be a whitespace between < and the hr).

Some day the monster generating the HTML becomes creative, and the
debugging session is interesting. I guess Gary knows that.

(I've got a nice anecdote about processing of XML line by line with
Perl and funny stuff, but I won't bore you with that :-)

Cheers
-- 
t

Attachment: signature.asc
Description: PGP signature


Reply to: