[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: id-utils (Was: Re: Search)



csmall@scooter.eye-net.com.au (Craig Small) writes:

> > Number of files (C, C++, asm, some text) was 104564.
> > Total size of indexed files was 2.60 GBytes.
> > There were 477960 distinct tokens, and the average token occurred 215 times.
> > High-water mark for memory consumed during indexing was 57 MB.
> > The size of the output index file was 50.4 MB.
> > The process took approx 14 minutes of CPU time and 30 minutes of real time.
> 
> So it took 30 minutes to index 2.6 Gig of "stuff" and the resulting
> index is 50 Meg?  That is definitely impressive and is faster and
> smaller by several orders of magnitude (most indexers would take, say,
> 8-16 hours on that size).
> 
> I still cannot get over its speed!  Why is it so fast? Are we missing
> some important tokens?

It's fast because I worked hard to make it fast!  8^)
The lexer has a fast, simple inner loop.  The in-memory symbol table
uses a double-hashing with open addressing, and errs on the side of
sizing tables too large so the collision rate is very low.

> I ran some tests on www.debian.org
> $ du -s /debian/web/debian.org/{intro,devel,ports,events,News}/
> 945     /debian/web/debian.org/intro
> 3304    /debian/web/debian.org/devel
> 1989    /debian/web/debian.org/ports
> 837     /debian/web/debian.org/events
> 4191    /debian/web/debian.org/News
> $ time mkid -m myid.map -o dwww.id /debian/web/debian.org/{intro,devel,ports,events,News}/
> [removing some errors about not being able to stat some files]
> real    0m22.884s
> user    0m10.020s
> sys     0m0.430s

For fun, run `mkid -V' to see progress and get statistics at the end
of the run.  `mkid -s' will just give you the stats without the
progress.

> The main problem is it doesn't understand html pages and the context of
> them and different languages.  So something in the body of a page gets 
> the same weighting as something in the title.

Yes.  It needs a specific html scanner.  I'm hoping there's a simple
way to fiddle with the locale at runtime to influence the behavior of
ctype, so that the scanner can be written in a language-independent
fashion in terms of isalpha/isdigit.  Beyond that, there needs to be
some special handling for some html directives such as "<meta ...>" to
extract keywords.

> But you've convinced me at least that it is on the right track, if you
> need help with some of the programming let me know.

I definitely could use some help!  First, you need to assign copyright
to the FSF for id-utils:
http://www.gnu.org/software/gcc/fsf-forms/assignment-instructions.html

Can you work on a scanner for HTML & language-specific text?  I think
that would be the best project for someone other than me to do.

Greg


Reply to: