Re: Search engine for documentation indexing?

To: debian-user@lists.debian.org
Cc: Olly Betts <olly@survex.com>
Subject: Re: Search engine for documentation indexing?
From: Ian Zimmerman <itz@buug.org>
Date: Wed, 10 Feb 2010 22:23:06 -0800
Message-id: <[🔎] 874olomgtx.fsf@matica.localdomain>
In-reply-to: <[🔎] loom.20100210T142616-486@post.gmane.org>

Olly> Automatically uncompressing gzipped files for indexing isn't hard
Olly> to do, but what can you link to for them in the search results?
Olly> Of the four web browsers I just tried, only w3m showed the
Olly> contents of file:///usr/share/doc/coreutils/README.gz rather than
Olly> downloading it for me.  Same for
Olly> http://localhost/doc/coreutils/README.gz it seems.

The dwww cgi uncompresses these files for you, so something like

http://localhost/cgi-bin/dwww/usr/share/doc/foo/README.gz

works (in any browser).

Also, it may be possible to use mod_deflate in Apache to transparently
uncompress, though I have never tried that.

Olly> But as others have said, recoll is probably a better choice for a
Olly> Xapian-based solution for a desktop situation anyway.

Well, but I'm not running a typical desktop, at least not if by that is
meant a Gnome or KDE stacked system.  The fact that recoll is bound to a
particular GUI is definitely a disadvantage.  So no orphaning omega
please! :-)

What I'm ending up doing is a hack: I build a separate tree that is
mostly a symlink farm pointing to /usr/share/doc, except that gzipped
files are replaced by their uncompressed versions.  Then I run omindex
on the new tree.  I've just tested this and it does the job.  Indexing
time is about 16 min, which is in between swish-e (9 min) and swish++
(27 min).  Not terrible, but maybe there is a way to speed it up by
parallelization?  The omega docs seem to say nothing about concurrent
access to the index.  Is it possible to run 2 indexer processes at once,
each updating the same index but with different files?

-- 
Ian Zimmerman <itz@buug.org>
gpg public key: 1024D/C6FF61AD 
fingerprint: 66DC D68F 5C1B 4D71 2EE5  BD03 8A00 786C C6FF 61AD
Ham is for reading, not for eating.

Reply to:

Follow-Ups:
- Re: Search engine for documentation indexing?
  - From: Celejar <celejar@gmail.com>
- Re: Search engine for documentation indexing?
  - From: John Magolske <listmail@b79.net>

References:
- Re: Search engine for documentation indexing?
  - From: Olly Betts <olly@survex.com>

Prev by Date: Re: font substitution by acroread
Next by Date: Aptitude status output meaning
Previous by thread: Re: Search engine for documentation indexing?
Next by thread: Re: Search engine for documentation indexing?
Index(es):
- Date
- Thread