[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Search engine for documentation indexing?



Olly> Automatically uncompressing gzipped files for indexing isn't hard
Olly> to do, but what can you link to for them in the search results?
Olly> Of the four web browsers I just tried, only w3m showed the
Olly> contents of file:///usr/share/doc/coreutils/README.gz rather than
Olly> downloading it for me.  Same for
Olly> http://localhost/doc/coreutils/README.gz it seems.

The dwww cgi uncompresses these files for you, so something like

http://localhost/cgi-bin/dwww/usr/share/doc/foo/README.gz

works (in any browser).

Also, it may be possible to use mod_deflate in Apache to transparently
uncompress, though I have never tried that.

Olly> But as others have said, recoll is probably a better choice for a
Olly> Xapian-based solution for a desktop situation anyway.

Well, but I'm not running a typical desktop, at least not if by that is
meant a Gnome or KDE stacked system.  The fact that recoll is bound to a
particular GUI is definitely a disadvantage.  So no orphaning omega
please! :-)

What I'm ending up doing is a hack: I build a separate tree that is
mostly a symlink farm pointing to /usr/share/doc, except that gzipped
files are replaced by their uncompressed versions.  Then I run omindex
on the new tree.  I've just tested this and it does the job.  Indexing
time is about 16 min, which is in between swish-e (9 min) and swish++
(27 min).  Not terrible, but maybe there is a way to speed it up by
parallelization?  The omega docs seem to say nothing about concurrent
access to the index.  Is it possible to run 2 indexer processes at once,
each updating the same index but with different files?

-- 
Ian Zimmerman <itz@buug.org>
gpg public key: 1024D/C6FF61AD 
fingerprint: 66DC D68F 5C1B 4D71 2EE5  BD03 8A00 786C C6FF 61AD
Ham is for reading, not for eating.


Reply to: