[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

remove an HTML tag and all its children from commandline



Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.

I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:

<div class="advertisement">
...
</div>

However I could not simply do this:
s/<div class="advertisement">.*</div>//

Because it is too greedy, that matches the "</div>" till the last, which
is almost always after the advertisement.

If I set it to not to be greedy, it also fail because it stops at the
first </div> inside the advertisement.

Consider this case that both greedy and non-greedy fail:

<div class="page-content">
  <div class="advertisement">
    <div>Our product is the best</div>
    <div>Contact us now!</div>
  </div>
</div>

Greedy output:

    <div class="page-content">

Non-greedy output:

    <div class="page-content">
        <div>Contact us now!</div>
      </div>
    </div>


Expected output:

    <div class="page-content">
    </div>

The only way to make it right seems to be able to give the replacement /
remove expression the ability to "count" the number of <div and </div>
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?


Reply to: