Re: id-utils (Was: Re: Search)
csmall@scooter.eye-net.com.au (Craig Small) writes:
> > Number of files (C, C++, asm, some text) was 104564.
> > Total size of indexed files was 2.60 GBytes.
> > There were 477960 distinct tokens, and the average token occurred 215 times.
> > High-water mark for memory consumed during indexing was 57 MB.
> > The size of the output index file was 50.4 MB.
> > The process took approx 14 minutes of CPU time and 30 minutes of real time.
>
> So it took 30 minutes to index 2.6 Gig of "stuff" and the resulting
> index is 50 Meg? That is definitely impressive and is faster and
> smaller by several orders of magnitude (most indexers would take, say,
> 8-16 hours on that size).
>
> I still cannot get over its speed! Why is it so fast? Are we missing
> some important tokens?
It's fast because I worked hard to make it fast! 8^)
The lexer has a fast, simple inner loop. The in-memory symbol table
uses a double-hashing with open addressing, and errs on the side of
sizing tables too large so the collision rate is very low.
> I ran some tests on www.debian.org
> $ du -s /debian/web/debian.org/{intro,devel,ports,events,News}/
> 945 /debian/web/debian.org/intro
> 3304 /debian/web/debian.org/devel
> 1989 /debian/web/debian.org/ports
> 837 /debian/web/debian.org/events
> 4191 /debian/web/debian.org/News
> $ time mkid -m myid.map -o dwww.id /debian/web/debian.org/{intro,devel,ports,events,News}/
> [removing some errors about not being able to stat some files]
> real 0m22.884s
> user 0m10.020s
> sys 0m0.430s
For fun, run `mkid -V' to see progress and get statistics at the end
of the run. `mkid -s' will just give you the stats without the
progress.
> The main problem is it doesn't understand html pages and the context of
> them and different languages. So something in the body of a page gets
> the same weighting as something in the title.
Yes. It needs a specific html scanner. I'm hoping there's a simple
way to fiddle with the locale at runtime to influence the behavior of
ctype, so that the scanner can be written in a language-independent
fashion in terms of isalpha/isdigit. Beyond that, there needs to be
some special handling for some html directives such as "<meta ...>" to
extract keywords.
> But you've convinced me at least that it is on the right track, if you
> need help with some of the programming let me know.
I definitely could use some help! First, you need to assign copyright
to the FSF for id-utils:
http://www.gnu.org/software/gcc/fsf-forms/assignment-instructions.html
Can you work on a scanner for HTML & language-specific text? I think
that would be the best project for someone other than me to do.
Greg
Reply to: