remove an HTML tag and all its children from commandline

To: debian-user@lists.debian.org
Subject: remove an HTML tag and all its children from commandline
From: Zhang Weiwu <zhangweiwu@realss.com>
Date: Sun, 31 Jan 2010 10:54:46 +0800
Message-id: <[🔎] 4B64F0F6.3070500@realss.com>

Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.

I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:

<div class="advertisement">
...
</div>

However I could not simply do this:
s/<div class="advertisement">.*</div>//

Because it is too greedy, that matches the "</div>" till the last, which
is almost always after the advertisement.

If I set it to not to be greedy, it also fail because it stops at the
first </div> inside the advertisement.

Consider this case that both greedy and non-greedy fail:

<div class="page-content">
  <div class="advertisement">
    <div>Our product is the best</div>
    <div>Contact us now!</div>
  </div>
</div>

Greedy output:

    <div class="page-content">

Non-greedy output:

    <div class="page-content">
        <div>Contact us now!</div>
      </div>
    </div>


Expected output:

    <div class="page-content">
    </div>

The only way to make it right seems to be able to give the replacement /
remove expression the ability to "count" the number of <div and </div>
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?

Reply to:

Follow-Ups:
- Re: remove an HTML tag and all its children from commandline
  - From: T o n g <mlist4suntong@yahoo.com>
- Re: remove an HTML tag and all its children from commandline
  - From: Celejar <celejar@gmail.com>
- Re: remove an HTML tag and all its children from commandline
  - From: Steve Kemp <skx@debian.org>

Prev by Date: Re: radeonhd: no acceleration DRI / AGP???
Next by Date: Re: build from source, patch does not remove cleanly
Previous by thread: Switch from X to any tty freeze the machine
Next by thread: Re: remove an HTML tag and all its children from commandline
Index(es):
- Date
- Thread