[gopher] Re: parallelizing Veronica-2
JumpJet has NO objection to any spider running multiple threads on the server.  Also, PLEASE index as much as you can (EVERY resource if possible), and do it on a regular basis (things change often).
--- On Wed, 7/23/08, chris <chris@hal3000.cx> wrote:
From: chris <chris@hal3000.cx>
Subject: [gopher] Re: parallelizing Veronica-2
To: gopher@complete.org
Date: Wednesday, July 23, 2008, 1:07 PM
I currently use 6 threads on my spider/crawler with no one complaining although
they run independent of each other to different sites untill sites are
exhausted and when it comes to the last few sites then the spiders work the
same site together or in groups. My Veronica indexes in a single thread as
well.
I don't think your going to be a burden on any servers.
Ohh my selector rest time is 3 seconds between hits.
Chris
On Tue, 22 Jul 2008 22:09:32 -0700 (PDT)
Cameron Kaiser <spectre@floodgap.com> wrote:
> One big problem with Veronica-2 is that the crawling process is presently
> quite slow (besides the manual data massaging that I have to do every so
> often, review the dead hosts list, prune pathologic selectors, etc.) and
this
> impacts the accuracy of the search database because it is nowhere near as
> sprightly as Google.
> 
> V-2 will never quite be the Google of Gopherspace but there are some
> optimizations I have in mind for increasing its coverage and therefore
> relevance. Some of these I'm implementing now.
> 
> However, the biggest change I am considering is parallelizing it (rather
> than paralysing it ;-). Right now there is a single thread running doing
> the crawling, which is somewhat inefficient, but done this way to make
> debugging easier. As it stands, there have been no major changes to the
> crawling core for almost two years -- I have been making various changes
> to the search client end, but not to the actual crawler.
> 
> For this reason, I'd like to increase the number of crawl threads from
one
> to three as a test case to see how well this operates. This won't
improve
> throughput 3x, because my profiling shows a fair bit of the load is
database
> writes, but it will improve it by a non-trivial factor. However, there is
> also the possibility that people will see parallel hits to their server if
> the crawlers have a small set of servers to iterate over at any given
time.
> To reduce the possibility of a loop causing the crawl threads to hammer
> individual hosts, each individual thread can at most hit a selector every
> five seconds even if it is switching to a different host just in case the
> interthread communication glitches. This should keep load down while
crawling
> is in progress as there will always be a hard rate limit.
> 
> People who have observed the crawler in operation will also note that it
> does not request every single resource anyway, since it doesn't index
them;
> it looks at menus primarily, and only individual resources if they were
> linked in from somewhere else to verify their existence. This goes a long
way
> to making V-2 a better neighbour, I think, and I did this by design.
> 
> Please let me know if there is any strenuous opposition to increasing the
> crawl rate. This will not go into effect probably for a few weeks while I
> internally debug the synchronization code, but it may be in operation by
> this fall.
> 
> -- 
> ------------------------------------ personal:
http://www.cameronkaiser.com/ --
>   Cameron Kaiser * Floodgap Systems * www.floodgap.com *
ckaiser@floodgap.com
> -- BOND THEME NOW PLAYING: "Diamonds Are Forever"
-----------------------------
> 
> 
> 
-- 
FreeBSD it's Da Bomb
      
Reply to: