[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

lists.debian.org archives and google indexing

(please CC: me any replies, as I am not subscribed to debian-www)


This email resulted from a request from someone on -devel/-user (I forget
which) and a small talk in #debian-devel.

Would it be possible to open parts of the list archives in lists.debian.org
to *selected* indexing bots? The glimpse search engine lists.debian.org
offers is sometimes limited in what it can do...

I propose that (at least) indexing bots from major public searchable
directories with a very good track record (in matters such as public
behaviour, crawler behaviour and relevance to GNU/Linux and GNU/Hurd users
due to the content of their searchable indexes) to be allowed to index the
*public* list archives.  Maybe the ability to narrow down searches to a
given domain (so that we can ask it to only search *debian.org) might be
required, as well if there is a perceived need for it.

(on the other hand, if bandwidth is not an issue, all indexer robots might
be allowed -- but I assume the bots were blocked in the first place to save
on bandwidth)

Of course, only search engines which are actually *requested* by someone
need to be considered for inclusion (btw, please consider google (crawler
robot id "Googlebot") requested).

Selective permission based on robot id is both supported by the robots.txt
file, or alternatively (should there be a need for not trusting the
robots.txt system for some reason) by a apache mod_rewrite rule which feeds
different robots.txt files depending on the requesting browser ("crawler").

BTW, I am in no way affiliated with google, I just happen to like their
search engine.

  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Reply to: