[gopher] Changes to Veronica-2 (and VISHNU's present retirement)

To: gopher-project@lists.alioth.debian.org
Subject: [gopher] Changes to Veronica-2 (and VISHNU's present retirement)
From: Cameron Kaiser <spectre@floodgap.com>
Date: Tue, 22 Dec 2015 10:46:03 -0800 (PST)
Message-id: <[🔎] 201512221846.tBMIk3fE14286980@floodgap.com>
Reply-to: Gopher Project Discussion <gopher-project@lists.alioth.debian.org>

Veronica-2's internals have been substantially rewritten (again). Currently,
as the database now carries close to 4 million selectors -- most of which are,
to my delight, perfectly valid -- certain keyword sets cause big pulls on the
database and some queries will not return. For example, a relatively
innocuous search for "debian linux" will pull about 1.67 million selectors
that need to be evaluated and scored. This query completes and is highly
accurate, but not within the one minute maximum timeout for queries sent
by outside clients. Some pathological ones I investigated took as long as
ten minutes. This problem will only get worse as Gopher slowly expands.

Google solves this by throwing hardware at the problem and sharding the heck
out of everything, but I can't afford to do anything much like that
(gopher.floodgap.com is a commercial-grade server with fast storage, but
its 2-way POWER6 CPU is showing its age, comparatively speaking). Although I
will probably unthrottle the CPU at some point and eat the additional power
usage cost, I wanted to see what I could wring out of it right now.

The current version now has a lot more predictive logic and even more
aggressive results-stage caching. If the predictor indicates that a query
is likely to go Cartesian, it then takes the most impactful keywords (as
determined by a tunable internal heuristic) and runs them against a second
cache that uses statistical sampling to pull a representative set, using
the more specific keywords' complete individual results for scoring purposes.
Since building this secondary cache is somewhat expensive, it does not do
so "live" (it takes about 30 minutes currently to analyze and generate the
extracts), but we're trying to rely on cached data more anyway, so this is
necessity turned into virtue.

tl;dr: Some queries will still be slow, but almost all should complete
within the one minute timeout, some queries will now be substantially
faster, and the majority will still return useful and relevant results.
Please report all weird and unexpected behaviour.

As a consequence, though, VISHNU is now removed from the V-2 menu. Because
it requires accessing and pulling the entire dataset to be relevant, it is
not compatible with the statistical sampling technique V-2 is now using as
the most useful queries for VISHNU tend to be the least scalable. It really
just needs a full redesign from scratch and I'm not sure of the best
approach yet.

If you were using VISHNU previously, it will still respond to queries 
(assuming they work), though it is no longer publicly exposed on any Floodgap
menu. It was a nice idea at the time, but that's what experiments are for.

-- 
------------------------------------ personal: http://www.cameronkaiser.com/ --
  Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@floodgap.com
-- Feeling a little blue in January is normal. -- Marilu Henner ---------------

_______________________________________________
Gopher-Project mailing list
Gopher-Project@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/gopher-project

Reply to:

Next by Date: [gopher] Motsognir question
Next by thread: [gopher] Motsognir question
Index(es):
- Date
- Thread