Re: Ask Jeeves Crawler access to Debian
On Fri, Jun 18, 2004 at 05:31:32PM +0100, MJ Ray wrote:
> On 2004-06-18 15:09:30 +0100 Kaushal Kurapati
> <KKurapati@askjeeves.com> wrote:
> >On bugs.debian.org, we notice that there is a "disallow" directive in
> >your robots.txt that blocks our crawler from accessing pages on your
> I cannot speak for Debian, but I suspect this is because generating
> the html version of the bugs site needs more CPU power than they are
> willing to give search engines for free.
Speaking as one of the Debian bug tracking system administrators,
although perhaps not for all of them:
Since almost all of bugs.debian.org is dynamically generated and very
densely hyperlinked (often to various different representations),
crawlers tend to sit there for days on end wandering through it for
relatively little gain. It's not uncommon for one of them to get totally
lost in the list of bugs indexed by submitter, which really isn't
relevant to them, and for one of us to come along some time later and
wonder why somebody's making ten thousand extremely similar queries in
sequence that take a few seconds each.
That sort of thing is why the robots.txt entry is there.
> Maybe if you were to make some suitably large donation to cover the
> cost of adding that power, people would reconsider.
It's mostly an effort/usefulness trade-off.
Colin Watson [email@example.com]