[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Search



On Mon, Mar 20, 2000 at 10:54:21PM -0700, Greg McGary wrote:
> "James A. Treacy" <treacy@debian.org> writes:
> 
> > Is it really necessary to know the charset used on the page? As long
> > as searches are 8 bit clean I would think that it wouldn't make a
> > difference.
> 
> The issue is how does one delimit tokens?  You need to know the
> character-classes in order to know when you have transitioned from one
> character class to another, and therefore ought to end the current
> token.  You then need to know which sequences of character classes to
> keep and which to toss (keep "words", toss sequences of whitespace and
> punctuation).  I suppose if non-word char classes (e.g., whitespace,
> punctuation) are consistent across all languages and charsets, then
> you can treat everything that's not a non-word as a word and be done
> with.  I don't know enough about the subject to judge.
> 
Obviously, neither do I. :) All Americans should be taken out back and
forced to learn another language (actually, I understand 3 languages
poorly -- 4 if you include English).

> > Unless you are interested in creating a general purpose cgi frontend
> > it is probably better if you work on the searching/indexing and
> > specialized parsers while we create the cgi interface.
> 
> That's fine by me.  The less I have to do the better. 8^) I'm sure I
> can mold the id-utils query interface to be whatever you like.
> 
Writing a general front end is tricky (there are a lot of details to
consider), but specific ones are easy. The only part that is a pain
is when you want to chop up results into multiple pages.

> > I'd love to help, but am already
> > overextended. :(
> 
> I think Craig is going to give that a go.  We've been discussing it
> offline already.  If you can handle the cgi frontend, that's plenty
> useful.
> 
Great.

> > With respect to parsers, do you have a suggestion on the best way to
> > handle the list archives? We currently generate a single file for each
> > list for each month (in standard mail format - some as big as 10MB).
> > Each file is then broken up into a directory containing one htmlized
> > file for each piece of mail. This generates a LOT of files. Do you
> > think it would be practical (from a speed point of view) to work
> > directly from the big files and extract the relevant mails on the fly?
> 
> Don't queries need to return the html file names, since that's what
> the users will see?
No. The query is sent through a cgi script and we can simply let the
results be returned through that url. We also have the choice of
whether the input for the script is displayed. This results in those
funky urls you often see, e.g. (this is totally made up)
  http://cgi.debian.org/cgi-bin/search?package=apache&arch=i386&version=stable
Of course, if the result of a cgi script is a static file, then it
makes sense to do a redirect to that file so the actual URL is displayed.

> The users never see the 10 MB monthly files, do
> they?
No. We extract the relevant section of the big file, wrap it in html, and
return it to the user. It would clearly be faster to break up the file
beforehand. Maintaining all those little files is a pain though. It's
a typical size (number of files in this case) versus speed tradeoff.

> Assuming the html files are what we want to index, we should
> just index them with no fancy footwork to save the open(2) system
> calls.  Email archives are index-once-and-for-all things, especially
> when mkid can build incrementally.  The development time of teaching
> the scanner about how to scan a large file but make index entries as
> though it had scanned the html files doesn't seem worth the trouble.
> 
There is no reason to pretend that the mail is html. html is not something
we care about from a searching perspective. In fact, everything inside of
html tags should be ignored (except for meta tags). Note that inside
a tag (between '<' and '>') is different than being between an opening
and closing tag, e.g. <p>This is a paragraph</p>. Craig is gonna have
fun with this, because some container tags don't require the closing
tag (<p> is a good example).

Since I started this, I'll finish the thought on using the large files
instead of many small ones so we can drop it. The more I think about it,
the more complicated it gets. The problem is we'd have to keep track of
two pieces of information to return a mail instead of just one.

When searching within a given month, a hit within a given piece of mail
would return the starting location (offset from beginning of file) for
that mail. Thus, each mail is uniquely specified by file and offset.
Displaying that piece of mail should then be relatively fast: print html
headers, open the file, move to the offset, read the mail headers,
mark up relevant headers and print them (ignore the others), read and print
body of mail, print html footers, close file.

Drat. I forgot that the result page for a search should include the sender,
date and subject for each mail. More complications.

On the off chance that this is doable and worth pursuing, it would be
even better if we could compress each file and work from those.
Picky, aren't I. :)

-- 
James (Jay) Treacy
treacy@debian.org


Reply to: