Re: Search

To: Greg McGary <gkm@eng.ascend.com>
Cc: debian-www@lists.debian.org, brad@lachman.com
Subject: Re: Search
From: "James A. Treacy" <treacy@debian.org>
Date: Tue, 21 Mar 2000 00:10:07 -0500
Message-id: <20000321001007.F29814@landru.home.link>
In-reply-to: <msd7opfq7b.fsf@gkm-dsl-194.ascend.com>; from gkm@eng.ascend.com on Mon, Mar 20, 2000 at 11:31:52AM -0700
References: <20000320141747.CE4B597A3@scooter.eye-net.com.au> <msog8af9qw.fsf@gkm-dsl-194.ascend.com> <20000320093611.E23208@landru.home.link> <msk8ixfx56.fsf@gkm-dsl-194.ascend.com> <20000320120513.A24993@landru.home.link> <msd7opfq7b.fsf@gkm-dsl-194.ascend.com>

On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote:

> > Files are of the form foo.lang.html, e.g. index.en.html.
> 
> OK.  That makes it very easy.  What's the complete list of languages,
> and what charset encoding is used for each?  I'm a lowly mono-lingual
> ugly American, but I have a brain-trust of i18n pros, so I'll get them
> to help me figure out how best to code language-specific scanners.
> 
Is it really necessary to know the charset used on the page? As long
as searches are 8 bit clean I would think that it wouldn't make a
difference.

> I should wget a representative sample of your site for use as test
> data.  Is there a subtree that has some of every language on the site?
> 
The most commonly translated page is the main page: http://www.debian.org/ .
At the bottom are links to all the translations. And if you started with
the english version, save that as index.en.html .

[snip]
> > Is this doable?
> 
> Definitely.  What query API must I provide for Apache?  Just point me
> to the documentation.
> 
> >   <meta name="Keywords" content="debian, main, stable, size:88.3 apache">
> 
> So, the above sample is for the package "apache", and the indexable
> key/value pair is labelled meta name/content, right?
> 
> > This makes it easy for us to restrict the search to packages by
> > distribution (main, non-free or contrib), release (stable, unstable or
> > frozen) and package name (or substrings of the name).
> 
> OK, that's another area that needs work.  
> 
Unless you are interested in creating a general purpose cgi frontend
it is probably better if you work on the searching/indexing and
specialized parsers while we create the cgi interface.

In a seperate mail you asked for help in creating the html parser
(I hope my terminology is correct). I'd love to help, but am already
overextended. :(

With respect to parsers, do you have a suggestion on the best way to
handle the list archives? We currently generate a single file for each
list for each month (in standard mail format - some as big as 10MB).
Each file is then broken up into a directory containing one htmlized
file for each piece of mail. This generates a LOT of files. Do you
think it would be practical (from a speed point of view) to work
directly from the big files and extract the relevant mails on the fly?

-- 
James (Jay) Treacy
treacy@debian.org

Reply to:

Follow-Ups:
- Re: Search
  - From: Greg McGary <gkm@eng.ascend.com>

Prev by Date: Re: Search
Next by Date: Re: `boot-floppies' documentation naming conventions?
Previous by thread: Re: id-utils (Was: Re: Search)
Next by thread: Re: Search
Index(es):
- Date
- Thread