[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: msgid.php

On Tue, Jan 17, 2006 at 02:05:27PM +0100, Adeodato Sim? wrote:
> Aieeee, any chance of getting a copy of msgid.php et al. so that
> somebody can run it elsewhere?

Here. It still needs some work - the php frontend does not handle
duplicate msgids (which exist) because writing php makes me want to
vomit, and update-index is too slow. The biggest problem is that it's
looking in every 'month' directory (like
debian-devel/2005/debian-devel-200510) instead of just the current
one, so it stats thousands of files every time, which makes DSA bitch
about disk IO time and cache consumed on master. It can't run anywhere
else, it needs a copy of the *actual* HTML archives - the process used
to generate them (and therefore the URLs to the mails) is
non-deterministic, so you have to process the results, and they aren't
mirrored anywhere.

Solving this isn't hugely difficult but it is subtle: you have to
record the last month you looked at, so that you can check no new
mails have been added since then.

Originally I had it running every 5 minutes, but I reduced it to every
30 to get neuro off my back - it takes about 20 to 30 seconds for each
sweep of the lists, at those intervals, most of which is spent waiting
for the kernel to come back from stat() calls (because master's disks
are usually busy). It would make sense to run rapid sweeps over -devel
and other high-traffic lists, and less frequent ones over the rest,
but I never got around to that either.

There's a subtle correctness issue in that it fails to notice when
listmasters delete spam from the archives, and in doing so change all
the URLs to mails after that point. I'm not sure what to do about
that; the root problem is that what the listmasters are doing is

Oh, and it's fucking ugly. I meant to rewrite it ages ago. I threw the
thing together in an hour or two. Conceptually it's simple but subtle.
Except for the php bit, which is a blunt instrument in homage to the
fact that master supports php but not perl.

(Initially building the index database takes something like 10 hours,
running at nice +20, and that's got to be on master too. I seem to
have accidentally killed off all my copies of it, thought I still had
one, oh well)

Andrew Suffield

Attachment: mindx.tar.gz
Description: Binary data

Attachment: signature.asc
Description: Digital signature

Reply to: