Re: webcrawl to cache dynamic pages

To: Richard Lyons <richard@the-place.net>
Cc: debian-user@lists.debian.org
Subject: Re: webcrawl to cache dynamic pages
From: David Hugh-Jones <davidhughjones@gmail.com>
Date: Mon, 9 May 2005 10:31:08 +0100
Message-id: <[🔎] f5d84806050509023146e89057@mail.gmail.com>
Reply-to: David Hugh-Jones <davidhughjones@gmail.com>
In-reply-to: <[🔎] 20050508234506.GO31132@potty.co-ho.net>
References: <[🔎] 20050502122741.GH1789@potty.co-ho.net> <[🔎] 20050506225131.GJ31132@potty.co-ho.net> <[🔎] 20050508074807.GC8137@lascartasdelavida.com> <[🔎] 20050508234506.GO31132@potty.co-ho.net>

If you end up wanting to do something more complicated, you could look
into WWW::Mechanize:

http://search.cpan.org/perldoc?WWW%3A%3AMechanize

David

On 09/05/05, Richard Lyons <richard@the-place.net> wrote:
> On Sun, May 08, 2005 at 09:48:07AM +0200, Nacho wrote:
> > > On Mon, May 02, 2005 at 01:27:41PM +0100, Richard Lyons wrote:
> > > > I am considering how to crawl a site which is dynamically generated,
> > > > and create a static version of all generated pages (or selected
> [...]
> >
> > Well, I don't know an "elegant" solution... one dirty approach would be to
> > first download the site with "wget -r", then you would get lots of files with
> > names like this:
> >
> > index.php?lang=es&tipo=obras&com=extracto
> > index.php?lang=es&tipo=obras&com=lista
> > index.php?lang=es&tipo=obras&com=susobras
> >
> > So it would be quite easy to write a simple perl script that substitutes the
> > special characters for others more "static-like", and you would get something
> > like:
> >
> > index_lang-es_tipo-obras_com-extracto.html
> > index_lang-es_tipo-obras_com-lista.html
> > index_lang-es_tipo-obras_com-susobras.html
> >
> > Also, surely you should have to parse the content of each file to substitute
> > the links inside them.
> >
> > Maybe too complicated?
> 
> Yes... that is the kind of thing I was imagining.  It will probably be
> quite simple once I get started.  But first I need to find time :-(
> 
> Thanks for the pointer.
> 
> --
> richard
> 
> 
> --
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> 
>

Reply to:

Follow-Ups:
- Re: webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>

References:
- webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>
- Re: webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>
- Re: webcrawl to cache dynamic pages
  - From: Nacho <listasdecorreo@lascartasdelavida.com>
- Re: webcrawl to cache dynamic pages
  - From: Richard Lyons <richard@the-place.net>

Prev by Date: Question about hard disk partition strategy for debian
Next by Date: Re: flash not working at http://www.scottrade.com/flash/flash.asp
Previous by thread: Re: webcrawl to cache dynamic pages
Next by thread: Re: webcrawl to cache dynamic pages
Index(es):
- Date
- Thread