[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

lurker + debian

As most of you are aware, I have for the last two years been writing a
mailing list archiver called 'lurker' (.sf.net). Recently, all the details
came together in a nicely unified whole which seems quite stable. As a
debian nut, I am highly interested in seeing lurker used on the debian lists
and have set most of my design requirements to meet this goal.

To summarize why lurker is good for debian:

	it scales to the volume of debian email with full-text search
	it supports multiple character sets in headers + body
	the threading is much more useful (imo)
	cross-posting is understood and works with threading
	attachments such as patches and signatures are treated correctly
	the debian archives were its testcase data

The biggest problem deploying it is that debian has a lot of mail.
The archive is so large that the lurker database will probably exceed 
2Gb which means that the system running lurker must have LFS support.

The other issue is that lurker must have all of the mail for a single
mailing list in one mailbox. This may cause problems since debian people
like to use mutt on the archived mailboxes and such a mbox would be far 
too large.

Let me briefly outline why lurker has one mailbox per list:

1. it makes automatic database upgrades possible after format changes
   (lurker can simply regenerate its database using the mailboxes in its
    database dir without having to flounder about asking for help)
2. it keeps the option of opening all the mailboxes at once available
   (prior versions of lurker did not have enough file descriptors for some
    really large mailing lists)
3. it makes lurker-index have the fire-and-forget property: simply piping a
   message into lurker-index is sufficient to have it looked after. you
   don't need to worry about keeping the source mailbox in the same location
   or touching the source mailbox with an editor, or even keeping it
4. it keeps users from poking the mailbox when it is "part of the database"
5. it is much simpler for the lurker code -> more robust

Of these I think 3 is the most important. Versions of lurker prior to v0.5
required the administrator to list each mailbox which comprised a mailing
list in lurker's config file. If there were new monthly mailboxes, the
config file needed to be updated. These mailboxes could then not be moved or
modified in any way other than append. Invariably, something went wrong.
Furthermore, because new messages were not fed via a push script like
lurker-index, a daemon was needed to monitor all the mailboxes. This, of
course, required they all be opened!

This current solution otoh, is far more simple. You simply take a single
message or mailbox, and pipe it through lurker-index saying which mailing
list it is for. Then you never have to worry about it ever again.

For these reasons I consider the new scheme to be superior from a usability
and robustness stand-point. I know that it takes more disk-space which is
why I only switched to at after a lot of deliberation. I just had to really
convince myself that: "disk is cheap". Besides, for normal users of lurker,
the mailbox does not need to be mutt accessable, so there is no need to keep
another copy of the mailbox. And if it were mutt accessable, you would have
to be absolutely certain mutt didn't change the Status: flag!

I have ideas for deploying lurker, but will keep my mouth shut unless asked
as I don't want to step on any administrator toes. I will mention however,
that the current debian interface can be preserved with lurker.


The pages like http://lists.debian.org/users.html can be built as static
html with a per-month perl cron job. This is because lurker message index
urls are keyed by date, so one can readily hard-code a url which jumps to
the current time.

The page http://lists.debian.org/search.html can still be static html. Now
it just submits to keyword.cgi. However, lurker searches operate differently
than glimpse since lurker uses a reverse-index rather than grep. This means
that the partial match, misspellings, and regexp can not be supported. Otoh,
the max messages returned and date are mostly irrelevant since lurker
returns results centered around a specified date. The search may then be
refined at that position in time, or you can move through time--backwards,
forwards, or by jumping. Finally, entries for specific search terms can be
added: author, subject, thread, reply-to, message-id.

For lurker-generated content, colour changes and so forth can be done with
the style-sheet. More structural changes can be done by tweaking the xslt
used to render html. I will make any specific UI changes required to adapt
lurker's appearance to match the debian site, although this could be done by
the webmaster if they are familiar with xslt.

Thanks for your time!

Wesley W. Terpstra <wesley@terpstra.ca>

Reply to: