[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Spam in the lists out of control

On Mon, May 10, 2004 at 04:09:33AM +0200, Marek Habersack wrote:
> On Sun, May 09, 2004 at 06:44:36PM +0200, Eike zyro Sauer scribbled:
> > Andrew Lau schrieb:
> > > Has debian.org's Spamassassin Bayesian database been poisoned? If so,
> > > would flushing the database at random intervals be enough to keep its
> > > usefulness feasible or would it just let too spam in after each flush?
> > 
> > I'd "donate" 6000 spam mails, if this helps.
> I could add my 14845 spams, too :)

Pfff... you can have my 63,286 spams if you really want, but it won't
really help you. The thing with a Bayesian database is that the mail
it's trained on needs to be similar to the mail it will be tested

For what it's worth, empirical evidence indicates that SpamAssassin's
Bayesian database is difficult to poison, since it's difficult for
spammers to pick words that are learned as non-spammy (since everyone
has their own set of non-spammy words). But, since lists.debian.org
doesn't use bayes, this point is moot.

What is more likely an issue is that the scores are not ideally set to
debian's needs. I have previously volunteered my assistance to run the
"perceptron" to generate better scores for Debian; however the problem
seems to be compiling a relatively large corpus of hand-sorted spam
and non-spam from debian lists.

SpamAssassin's scores are (as of the "soon" to be released version
3.0.0) chosen using a "Stochastic Gradient Descent" method based on
results from running tens/hundreds of thousands of messages through
SpamAssassin. This is an attempt to have results that are okay for
most, but given the Debian has unique characteristics in its mail,
different scores could be generated that would improve results. (Less
allowance would be needed for HTML mail, etc, so the score could be
set higher)

Duncan Findlay

Attachment: signature.asc
Description: Digital signature

Reply to: