Re: Search engine for documentation indexing?
Olly> Automatically uncompressing gzipped files for indexing isn't hard
Olly> to do, but what can you link to for them in the search results?
Olly> Of the four web browsers I just tried, only w3m showed the
Olly> contents of file:///usr/share/doc/coreutils/README.gz rather than
Olly> downloading it for me. Same for
Olly> http://localhost/doc/coreutils/README.gz it seems.
The dwww cgi uncompresses these files for you, so something like
http://localhost/cgi-bin/dwww/usr/share/doc/foo/README.gz
works (in any browser).
Also, it may be possible to use mod_deflate in Apache to transparently
uncompress, though I have never tried that.
Olly> But as others have said, recoll is probably a better choice for a
Olly> Xapian-based solution for a desktop situation anyway.
Well, but I'm not running a typical desktop, at least not if by that is
meant a Gnome or KDE stacked system. The fact that recoll is bound to a
particular GUI is definitely a disadvantage. So no orphaning omega
please! :-)
What I'm ending up doing is a hack: I build a separate tree that is
mostly a symlink farm pointing to /usr/share/doc, except that gzipped
files are replaced by their uncompressed versions. Then I run omindex
on the new tree. I've just tested this and it does the job. Indexing
time is about 16 min, which is in between swish-e (9 min) and swish++
(27 min). Not terrible, but maybe there is a way to speed it up by
parallelization? The omega docs seem to say nothing about concurrent
access to the index. Is it possible to run 2 indexer processes at once,
each updating the same index but with different files?
--
Ian Zimmerman <itz@buug.org>
gpg public key: 1024D/C6FF61AD
fingerprint: 66DC D68F 5C1B 4D71 2EE5 BD03 8A00 786C C6FF 61AD
Ham is for reading, not for eating.
Reply to: