[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Search



"James A. Treacy" <treacy@debian.org> writes:

> On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote:
> 
> > > Files are of the form foo.lang.html, e.g. index.en.html.
> > 
> > OK.  That makes it very easy.  What's the complete list of languages,
> > and what charset encoding is used for each?  I'm a lowly mono-lingual
> > ugly American, but I have a brain-trust of i18n pros, so I'll get them
> > to help me figure out how best to code language-specific scanners.

> Is it really necessary to know the charset used on the page? As long
> as searches are 8 bit clean I would think that it wouldn't make a
> difference.

The issue is how does one delimit tokens?  You need to know the
character-classes in order to know when you have transitioned from one
character class to another, and therefore ought to end the current
token.  You then need to know which sequences of character classes to
keep and which to toss (keep "words", toss sequences of whitespace and
punctuation).  I suppose if non-word char classes (e.g., whitespace,
punctuation) are consistent across all languages and charsets, then
you can treat everything that's not a non-word as a word and be done
with.  I don't know enough about the subject to judge.

> Unless you are interested in creating a general purpose cgi frontend
> it is probably better if you work on the searching/indexing and
> specialized parsers while we create the cgi interface.

That's fine by me.  The less I have to do the better. 8^) I'm sure I
can mold the id-utils query interface to be whatever you like.

> In a seperate mail you asked for help in creating the html parser
> (I hope my terminology is correct).

I call mkid's token gatherers "scanners" rather than "parsers", in
order to emphasize their simple-minded lexical nature and fast
execution.  Since they must pick out keywords, they do a little
parsing, but it's nothing close to the sophistication or overhead of,
say, as a context-free grammar.

> I'd love to help, but am already
> overextended. :(

I think Craig is going to give that a go.  We've been discussing it
offline already.  If you can handle the cgi frontend, that's plenty
useful.

> With respect to parsers, do you have a suggestion on the best way to
> handle the list archives? We currently generate a single file for each
> list for each month (in standard mail format - some as big as 10MB).
> Each file is then broken up into a directory containing one htmlized
> file for each piece of mail. This generates a LOT of files. Do you
> think it would be practical (from a speed point of view) to work
> directly from the big files and extract the relevant mails on the fly?

Don't queries need to return the html file names, since that's what
the users will see?  The users never see the 10 MB monthly files, do
they?  Assuming the html files are what we want to index, we should
just index them with no fancy footwork to save the open(2) system
calls.  Email archives are index-once-and-for-all things, especially
when mkid can build incrementally.  The development time of teaching
the scanner about how to scan a large file but make index entries as
though it had scanned the html files doesn't seem worth the trouble.

Greg


Reply to: