[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: lists.debian.org archives and google indexing

On Thu, Dec 14, 2000 at 07:20:47PM -0200, Henrique M Holschuh wrote:
> (please CC: me any replies, as I am not subscribed to debian-www)
> Hello,
> This email resulted from a request from someone on -devel/-user (I forget
> which) and a small talk in #debian-devel.

This someone was maybe me. After checking the robots.txt I asked here
(debian-www). I quote it again because you are not on the list.

On Thu, Dec 14, 2000 at 01:30:14PM +0100, Thomas Guettler wrote:
> Why do you disallow /Lists-Archives/?  I used to search the archive
> with google. I prefered this to http://lists.debian.org/search.html
> because google searches in all lists and all archives by default.
> (site:debian.org)

>Because of severe problems with robots (spiders) -- googlebot brought the
>whole machine down last time. I'm still wary of allowing robots access to 
>that, even though we have robots.txt for BTS pages now...

The googleBot caused to much traffic for the machine? Maybe we could
ask google to scan the site with reduced bandwith. (Few pages per
second or something like that). It is no good if the bot behaves like
a DoS-attack.

BTW, there is so much email-traffic in debian (like in the whole
opensource sector). That I think maillinglist are not the best way
for communicating.

I leave this quote here, if someone wants to quote it later.
> Would it be possible to open parts of the list archives in lists.debian.org
> to *selected* indexing bots? The glimpse search engine lists.debian.org
> offers is sometimes limited in what it can do...
> I propose that (at least) indexing bots from major public searchable
> directories with a very good track record (in matters such as public
> behaviour, crawler behaviour and relevance to GNU/Linux and GNU/Hurd users
> due to the content of their searchable indexes) to be allowed to index the
> *public* list archives.  Maybe the ability to narrow down searches to a
> given domain (so that we can ask it to only search *debian.org) might be
> required, as well if there is a perceived need for it.
> (on the other hand, if bandwidth is not an issue, all indexer robots might
> be allowed -- but I assume the bots were blocked in the first place to save
> on bandwidth)
> Of course, only search engines which are actually *requested* by someone
> need to be considered for inclusion (btw, please consider google (crawler
> robot id "Googlebot") requested).
> Selective permission based on robot id is both supported by the robots.txt
> file, or alternatively (should there be a need for not trusting the
> robots.txt system for some reason) by a apache mod_rewrite rule which feeds
> different robots.txt files depending on the requesting browser ("crawler").
> BTW, I am in no way affiliated with google, I just happen to like their
> search engine.

Thomas Guettler
Office: <guettli@interface-business.de> www.interface-business.de
Private:<guettli@gmx.de>  http://yi.org/guettli

Reply to: