On Mon, May 10, 2004 at 04:09:33AM +0200, Marek Habersack wrote: > On Sun, May 09, 2004 at 06:44:36PM +0200, Eike zyro Sauer scribbled: > > Andrew Lau schrieb: > > > Has debian.org's Spamassassin Bayesian database been poisoned? If so, > > > would flushing the database at random intervals be enough to keep its > > > usefulness feasible or would it just let too spam in after each flush? > > > > I'd "donate" 6000 spam mails, if this helps. > I could add my 14845 spams, too :) Pfff... you can have my 63,286 spams if you really want, but it won't really help you. The thing with a Bayesian database is that the mail it's trained on needs to be similar to the mail it will be tested against. For what it's worth, empirical evidence indicates that SpamAssassin's Bayesian database is difficult to poison, since it's difficult for spammers to pick words that are learned as non-spammy (since everyone has their own set of non-spammy words). But, since lists.debian.org doesn't use bayes, this point is moot. What is more likely an issue is that the scores are not ideally set to debian's needs. I have previously volunteered my assistance to run the "perceptron" to generate better scores for Debian; however the problem seems to be compiling a relatively large corpus of hand-sorted spam and non-spam from debian lists. SpamAssassin's scores are (as of the "soon" to be released version 3.0.0) chosen using a "Stochastic Gradient Descent" method based on results from running tens/hundreds of thousands of messages through SpamAssassin. This is an attempt to have results that are okay for most, but given the Debian has unique characteristics in its mail, different scores could be generated that would improve results. (Less allowance would be needed for HTML mail, etc, so the score could be set higher) -- Duncan Findlay
Attachment:
signature.asc
Description: Digital signature