remove an HTML tag and all its children from commandline
Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.
I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:
<div class="advertisement">
...
</div>
However I could not simply do this:
s/<div class="advertisement">.*</div>//
Because it is too greedy, that matches the "</div>" till the last, which
is almost always after the advertisement.
If I set it to not to be greedy, it also fail because it stops at the
first </div> inside the advertisement.
Consider this case that both greedy and non-greedy fail:
<div class="page-content">
<div class="advertisement">
<div>Our product is the best</div>
<div>Contact us now!</div>
</div>
</div>
Greedy output:
<div class="page-content">
Non-greedy output:
<div class="page-content">
<div>Contact us now!</div>
</div>
</div>
Expected output:
<div class="page-content">
</div>
The only way to make it right seems to be able to give the replacement /
remove expression the ability to "count" the number of <div and </div>
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?
Reply to: