[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: webcrawl to cache dynamic pages



> On Mon, May 02, 2005 at 01:27:41PM +0100, Richard Lyons wrote:
> > I am considering how to crawl a site which is dynamically generated,
> > and create a static version of all generated pages (or selected
> > generated pages).  I guess it would be simplest to start with an
> > existing crawler, and bolt on some code. Or, alternatively, write a
> > script (perl, I fear) to modify the cache built by a crawler. 
> > 
> > The idea is to allow a static ecommerce site to be generated from any
> > database-generated shopping cart system.
> > 
> > Any advice where to begin?

Well, I don't know an "elegant" solution... one dirty approach would be to
first download the site with "wget -r", then you would get lots of files with
names like this:

index.php?lang=es&tipo=obras&com=extracto
index.php?lang=es&tipo=obras&com=lista
index.php?lang=es&tipo=obras&com=susobras

So it would be quite easy to write a simple perl script that substitutes the
special characters for others more "static-like", and you would get something
like:

index_lang-es_tipo-obras_com-extracto.html
index_lang-es_tipo-obras_com-lista.html
index_lang-es_tipo-obras_com-susobras.html

Also, surely you should have to parse the content of each file to substitute
the links inside them.

Maybe too complicated?

Regards:

Nacho

-- 
No book comes out of a vacuum (G. Buehler)
http://www.lascartasdelavida.com



Reply to: