On Sun, May 09, 2004 at 11:57:17PM -0400, Duncan Findlay scribbled: > On Mon, May 10, 2004 at 04:09:33AM +0200, Marek Habersack wrote: > > On Sun, May 09, 2004 at 06:44:36PM +0200, Eike zyro Sauer scribbled: > > > Andrew Lau schrieb: > > > > Has debian.org's Spamassassin Bayesian database been poisoned? If so, > > > > would flushing the database at random intervals be enough to keep its > > > > usefulness feasible or would it just let too spam in after each flush? > > > > > > I'd "donate" 6000 spam mails, if this helps. > > I could add my 14845 spams, too :) > > Pfff... you can have my 63,286 spams if you really want, but it won't > really help you. The thing with a Bayesian database is that the mail > it's trained on needs to be similar to the mail it will be tested > against. Most of my spam comes from the debian lists, so I would say it is similar enough to the traffic down here. > For what it's worth, empirical evidence indicates that SpamAssassin's > Bayesian database is difficult to poison, since it's difficult for > spammers to pick words that are learned as non-spammy (since everyone > has their own set of non-spammy words). But, since lists.debian.org > doesn't use bayes, this point is moot. I don't understand why is SpamAssassin thought to be the only option? SA is a CPU/memory hog, it can easily kill even a fairly powerful machine and there _are_ alternatives to it. One thing to use could be dspam, as I pointed at in the other post, another (which also uses language classification and is already packaged for debian) would be crm114 and then there is a whole host of bayesian filter programs that are written in a language suited for heavy-duty tasks (C, that is :>). Both dspam and crm114 boast over 99% accuracy in spotting spam, now that would be really neat if we had that level of protection around here. regards, marek
Attachment:
signature.asc
Description: Digital signature