[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[gopher] Re: New Gopher Wayback Machine Bot

> > > Cameron, floodgap.com seems to have some sort of rate limiting and keeps
> > > giving me a Connection refused error after a certain number of documents
> > > have been spidered.
> > 
> > I'm a little concerned about your project since I do host a number of large
> > subparts which are actually proxied services, and I think even a gentle bot
> > going methodically through them would not be pleasant for the other side
> > (especially if you mean to regularly update your snapshot).
> Valid concern.  I had actually already marked your site off-limits
> because I noticed that.  Incidentally, your robots.txt doesn't seem to
> disallow anything -- might want to take a look at that ;-)

I know ;) it's because Veronica-2 won't harm the proxied services due to
the way it operates. However, I should be able to accomodate other bots that
may be around or come on board, so I'll rectify this.

> > I do support robots.txt, see
> > 
> > 	gopher.floodgap.com/0/v2/help/indexer
> Do you happen to have the source code for that available?  I've got
> some questions for you that it could explain (or you could), such as:
>  1. Which would you use?  (Do you expect URLs to be HTTP-escaped?)
>     Disallow: /Applications and Games
>     Disallow: /Applications%20and%20Games
> 2. Do you assume that all Disallow patterns begin with a slash as they
>    do in HTML, even if the Gopher selector doesn't?
> 3. Do you have any special code to handle the UMN case where
>    1/foo, /foo, and foo all refer to the same document?
> I will be adding robots.txt support to my bot and restarting it shortly.

It does not understand URL escaping, but literal selectors only. In the
case of #2/#3, well, maybe it would be better just to post the relevant code.
It should be relatively easy to understand (in Perl, from the V-2 iteration
library). $psr is the persistent state hash reference, and key "xcnd" contains
a list of selectors generated from Disallow: lines with User-agent: veronica
or *.

        # filter on exclusions
        my %excludes = %{ $psr->{"$host:$port"}->{"xcnd"} };
        my $key;
        foreach $key (sort { length($a) <=> length($b) } keys %excludes) {
                return (undef, undef, undef, undef, undef,
                                'excluded by robots.txt', 1)
                        if ($key eq $sel || $key eq "$sel/" ||
                                ($key =~ m#/$# &&
                                substr($sel, 0, length($key)) eq $key));

As you can see from here, they would need to be specified separately, since
other servers might not treat them the same.

---------------------------------- personal: http://www.armory.com/~spectre/ --
 Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com
-- An apple every eight hours will keep three doctors away. -------------------

Reply to: