Re: webcrawl to cache dynamic pages
> On Mon, May 02, 2005 at 01:27:41PM +0100, Richard Lyons wrote:
> > I am considering how to crawl a site which is dynamically generated,
> > and create a static version of all generated pages (or selected
> > generated pages). I guess it would be simplest to start with an
> > existing crawler, and bolt on some code. Or, alternatively, write a
> > script (perl, I fear) to modify the cache built by a crawler.
> >
> > The idea is to allow a static ecommerce site to be generated from any
> > database-generated shopping cart system.
> >
> > Any advice where to begin?
Well, I don't know an "elegant" solution... one dirty approach would be to
first download the site with "wget -r", then you would get lots of files with
names like this:
index.php?lang=es&tipo=obras&com=extracto
index.php?lang=es&tipo=obras&com=lista
index.php?lang=es&tipo=obras&com=susobras
So it would be quite easy to write a simple perl script that substitutes the
special characters for others more "static-like", and you would get something
like:
index_lang-es_tipo-obras_com-extracto.html
index_lang-es_tipo-obras_com-lista.html
index_lang-es_tipo-obras_com-susobras.html
Also, surely you should have to parse the content of each file to substitute
the links inside them.
Maybe too complicated?
Regards:
Nacho
--
No book comes out of a vacuum (G. Buehler)
http://www.lascartasdelavida.com
Reply to: