[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: remove an HTML tag and all its children from commandline



On Sun, 31 Jan 2010 10:54:46 +0800
Zhang Weiwu <zhangweiwu@realss.com> wrote:

...

> I want to remove all advertisements in my 100 html files. They are
> pretty neatly classed, like the following:
> 
> <div class="advertisement">
> ...
> </div>
> 
> However I could not simply do this:
> s/<div class="advertisement">.*</div>//
> 
> Because it is too greedy, that matches the "</div>" till the last, which
> is almost always after the advertisement.
> 
> If I set it to not to be greedy, it also fail because it stops at the
> first </div> inside the advertisement.

...

> The only way to make it right seems to be able to give the replacement /
> remove expression the ability to "count" the number of <div and </div>
> it encounters. I could program such thing in C thanks to my college
> education, but it sounds overkill for such a common task. What would you
> do in this case?

"Among programmers of any experience, it is generally regarded as A Bad
Ideatm to attempt to parse HTML with regular expressions. How bad of an
idea? It apparently drove one Stack Overflow user to the brink of
madness:

"You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML. As
I have answered in HTML-and-regex questions here so many times before,
the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to
understand the constructs employed by HTML. HTML is not a regular
language and hence cannot be parsed by regular expressions. Regex
queries are not equipped to break down HTML into its meaningful parts.
so many times but it is not getting to me. Even enhanced irregular
regular expressions as used by Perl are not up to the task of parsing
HTML. You will never make me crack. HTML is a language of sufficient
complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time
you attempt to parse HTML with regular expressions, the unholy child
weeps the blood of virgins, and Russian hackers pwn your webapp.
Parsing HTML with regex summons tainted souls into the realm of the
living. HTML and regex go together like love, marriage, and ritual
infanticide. The <center> cannot hold it is too late. The force of
regex and HTML together in the same conceptual space will destroy your
mind like so much watery putty. If you parse HTML with regex you are
giving in to Them and their blasphemous ways which doom us all to
inhuman toil for the One whose Name cannot be expressed in the Basic
Multilingual Plane, he comes."

That's right, if you attempt to parse HTML with regular expressions,
you're succumbing to the temptations of the dark god Cthulhu's … er …
code."

http://www.codinghorror.com/blog/archives/001311.html

Read on for more detail, and the Right Way to do this.

Celejar
-- 
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator


Reply to: