Re: webcrawl to cache dynamic pages

To: Richard Lyons <richard@the-place.net>
Subject: Re: webcrawl to cache dynamic pages
From: Nacho <listasdecorreo@lascartasdelavida.com>
Date: Sun, 8 May 2005 09:48:07 +0200
Message-id: <[🔎] 20050508074807.GC8137@lascartasdelavida.com>
In-reply-to: <[🔎] 20050506225131.GJ31132@potty.co-ho.net>
References: <[🔎] 20050502122741.GH1789@potty.co-ho.net> <[🔎] 20050506225131.GJ31132@potty.co-ho.net>

> On Mon, May 02, 2005 at 01:27:41PM +0100, Richard Lyons wrote:
> > I am considering how to crawl a site which is dynamically generated,
> > and create a static version of all generated pages (or selected
> > generated pages).  I guess it would be simplest to start with an
> > existing crawler, and bolt on some code. Or, alternatively, write a
> > script (perl, I fear) to modify the cache built by a crawler. 
> > 
> > The idea is to allow a static ecommerce site to be generated from any
> > database-generated shopping cart system.
> > 
> > Any advice where to begin?

Well, I don't know an "elegant" solution... one dirty approach would be to
first download the site with "wget -r", then you would get lots of files with
names like this:

index.php?lang=es&tipo=obras&com=extracto
index.php?lang=es&tipo=obras&com=lista
index.php?lang=es&tipo=obras&com=susobras

So it would be quite easy to write a simple perl script that substitutes the
special characters for others more "static-like", and you would get something
like:

index_lang-es_tipo-obras_com-extracto.html
index_lang-es_tipo-obras_com-lista.html
index_lang-es_tipo-obras_com-susobras.html

Also, surely you should have to parse the content of each file to substitute
the links inside them.

Maybe too complicated?

Regards:

Nacho

-- 
No book comes out of a vacuum (G. Buehler)
http://www.lascartasdelavida.com

Reply to:

Follow-Ups:
- Re: webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>

References:
- webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>
- Re: webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>

Prev by Date: Re: New kernel compiled does't work, gives errors
Next by Date: Re: disk cloning with dd
Previous by thread: SOLVED: webcrawl to cache dynamic pages
Next by thread: Re: webcrawl to cache dynamic pages
Index(es):
- Date
- Thread