[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

id-utils (Was: Re: Search)



[moved it into debian-www where discussion belongs]
Greg McGary said:
> To give you a feel for performance, I ran mkid on a collection of
> source trees:
> 
> Number of files (C, C++, asm, some text) was 104564.
> Total size of indexed files was 2.60 GBytes.
> There were 477960 distinct tokens, and the average token occurred 215 times.
> High-water mark for memory consumed during indexing was 57 MB.
> The size of the output index file was 50.4 MB.
> The process took approx 14 minutes of CPU time and 30 minutes of real time.
So it took 30 minutes to index 2.6 Gig of "stuff" and the resulting
index is 50 Meg?  That is definitely impressive and is faster and
smaller by several orders of magnitude (most indexers would take, say,
8-16 hours on that size).

I still cannot get over its speed!  Why is it so fast? Are we missing
some important tokens?

I ran some tests on www.debian.org
$ du -s /debian/web/debian.org/{intro,devel,ports,events,News}/
945     /debian/web/debian.org/intro
3304    /debian/web/debian.org/devel
1989    /debian/web/debian.org/ports
837     /debian/web/debian.org/events
4191    /debian/web/debian.org/News
$ time mkid -m myid.map -o dwww.id /debian/web/debian.org/{intro,devel,ports,events,News}/
[removing some errors about not being able to stat some files]
real    0m22.884s
user    0m10.020s
sys     0m0.430s

Not bad at all. 22 seconds to index about 11 meg.  Got a 1 Meg index
file which is about 10%.

> Anyway, that should giddve you an idea of mkid's performance
> characteristics.  If you like, try it yourself -- id-utils-3.2d is
> distributed with Debian, AFAIK.
I ran it across some directories I have here and we still have some
way to go with id-utils, but it is definitely very interesting idea.

The main problem is it doesn't understand html pages and the context of
them and different languages.  So something in the body of a page gets 
the same weighting as something in the title.

But you've convinced me at least that it is on the right track, if you
need help with some of the programming let me know.

  - Craig
-- 
Craig Small VK2XLZ, PGP: AD 8D D8 63 6E BF C3 C7  47 41 B1 A2 1F 46 EC 90
Eye-Net Consulting http://www.eye-net.com.au/     <csmall@eye-net.com.au>
MIEEE <csmall@ieee.org>              Debian developer <csmall@debian.org>


Reply to: