[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Search



On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote:

> > Files are of the form foo.lang.html, e.g. index.en.html.
> 
> OK.  That makes it very easy.  What's the complete list of languages,
> and what charset encoding is used for each?  I'm a lowly mono-lingual
> ugly American, but I have a brain-trust of i18n pros, so I'll get them
> to help me figure out how best to code language-specific scanners.
> 
Is it really necessary to know the charset used on the page? As long
as searches are 8 bit clean I would think that it wouldn't make a
difference.

> I should wget a representative sample of your site for use as test
> data.  Is there a subtree that has some of every language on the site?
> 
The most commonly translated page is the main page: http://www.debian.org/ .
At the bottom are links to all the translations. And if you started with
the english version, save that as index.en.html .

[snip]
> > Is this doable?
> 
> Definitely.  What query API must I provide for Apache?  Just point me
> to the documentation.
> 
> >   <meta name="Keywords" content="debian, main, stable, size:88.3 apache">
> 
> So, the above sample is for the package "apache", and the indexable
> key/value pair is labelled meta name/content, right?
> 
> > This makes it easy for us to restrict the search to packages by
> > distribution (main, non-free or contrib), release (stable, unstable or
> > frozen) and package name (or substrings of the name).
> 
> OK, that's another area that needs work.  
> 
Unless you are interested in creating a general purpose cgi frontend
it is probably better if you work on the searching/indexing and
specialized parsers while we create the cgi interface.

In a seperate mail you asked for help in creating the html parser
(I hope my terminology is correct). I'd love to help, but am already
overextended. :(

With respect to parsers, do you have a suggestion on the best way to
handle the list archives? We currently generate a single file for each
list for each month (in standard mail format - some as big as 10MB).
Each file is then broken up into a directory containing one htmlized
file for each piece of mail. This generates a LOT of files. Do you
think it would be practical (from a speed point of view) to work
directly from the big files and extract the relevant mails on the fly?

-- 
James (Jay) Treacy
treacy@debian.org


Reply to: