On Tue, Jan 17, 2006 at 02:05:27PM +0100, Adeodato Sim? wrote: > Aieeee, any chance of getting a copy of msgid.php et al. so that > somebody can run it elsewhere? Here. It still needs some work - the php frontend does not handle duplicate msgids (which exist) because writing php makes me want to vomit, and update-index is too slow. The biggest problem is that it's looking in every 'month' directory (like debian-devel/2005/debian-devel-200510) instead of just the current one, so it stats thousands of files every time, which makes DSA bitch about disk IO time and cache consumed on master. It can't run anywhere else, it needs a copy of the *actual* HTML archives - the process used to generate them (and therefore the URLs to the mails) is non-deterministic, so you have to process the results, and they aren't mirrored anywhere. Solving this isn't hugely difficult but it is subtle: you have to record the last month you looked at, so that you can check no new mails have been added since then. Originally I had it running every 5 minutes, but I reduced it to every 30 to get neuro off my back - it takes about 20 to 30 seconds for each sweep of the lists, at those intervals, most of which is spent waiting for the kernel to come back from stat() calls (because master's disks are usually busy). It would make sense to run rapid sweeps over -devel and other high-traffic lists, and less frequent ones over the rest, but I never got around to that either. There's a subtle correctness issue in that it fails to notice when listmasters delete spam from the archives, and in doing so change all the URLs to mails after that point. I'm not sure what to do about that; the root problem is that what the listmasters are doing is crazy. Oh, and it's fucking ugly. I meant to rewrite it ages ago. I threw the thing together in an hour or two. Conceptually it's simple but subtle. Except for the php bit, which is a blunt instrument in homage to the fact that master supports php but not perl. (Initially building the index database takes something like 10 hours, running at nice +20, and that's got to be on master too. I seem to have accidentally killed off all my copies of it, thought I still had one, oh well) -- Andrew Suffield
Attachment:
mindx.tar.gz
Description: Binary data
Attachment:
signature.asc
Description: Digital signature