Re: Search
- To: Greg McGary <gkm@eng.ascend.com>
- Cc: debian-www@lists.debian.org, brad@lachman.com
- Subject: Re: Search
- From: "James A. Treacy" <treacy@debian.org>
- Date: Tue, 21 Mar 2000 00:10:07 -0500
- Message-id: <20000321001007.F29814@landru.home.link>
- In-reply-to: <msd7opfq7b.fsf@gkm-dsl-194.ascend.com>; from gkm@eng.ascend.com on Mon, Mar 20, 2000 at 11:31:52AM -0700
- References: <20000320141747.CE4B597A3@scooter.eye-net.com.au> <msog8af9qw.fsf@gkm-dsl-194.ascend.com> <20000320093611.E23208@landru.home.link> <msk8ixfx56.fsf@gkm-dsl-194.ascend.com> <20000320120513.A24993@landru.home.link> <msd7opfq7b.fsf@gkm-dsl-194.ascend.com>
On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote:
> > Files are of the form foo.lang.html, e.g. index.en.html.
>
> OK. That makes it very easy. What's the complete list of languages,
> and what charset encoding is used for each? I'm a lowly mono-lingual
> ugly American, but I have a brain-trust of i18n pros, so I'll get them
> to help me figure out how best to code language-specific scanners.
>
Is it really necessary to know the charset used on the page? As long
as searches are 8 bit clean I would think that it wouldn't make a
difference.
> I should wget a representative sample of your site for use as test
> data. Is there a subtree that has some of every language on the site?
>
The most commonly translated page is the main page: http://www.debian.org/ .
At the bottom are links to all the translations. And if you started with
the english version, save that as index.en.html .
[snip]
> > Is this doable?
>
> Definitely. What query API must I provide for Apache? Just point me
> to the documentation.
>
> > <meta name="Keywords" content="debian, main, stable, size:88.3 apache">
>
> So, the above sample is for the package "apache", and the indexable
> key/value pair is labelled meta name/content, right?
>
> > This makes it easy for us to restrict the search to packages by
> > distribution (main, non-free or contrib), release (stable, unstable or
> > frozen) and package name (or substrings of the name).
>
> OK, that's another area that needs work.
>
Unless you are interested in creating a general purpose cgi frontend
it is probably better if you work on the searching/indexing and
specialized parsers while we create the cgi interface.
In a seperate mail you asked for help in creating the html parser
(I hope my terminology is correct). I'd love to help, but am already
overextended. :(
With respect to parsers, do you have a suggestion on the best way to
handle the list archives? We currently generate a single file for each
list for each month (in standard mail format - some as big as 10MB).
Each file is then broken up into a directory containing one htmlized
file for each piece of mail. This generates a LOT of files. Do you
think it would be practical (from a speed point of view) to work
directly from the big files and extract the relevant mails on the fly?
--
James (Jay) Treacy
treacy@debian.org
Reply to: