[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Spamassassin, keep feeding messages for bayes?



on Sat, Nov 15, 2003 at 06:05:03AM -0800, Steve Lamb (grey@dmiyu.org) wrote:
> Karsten M. Self wrote:
> >SA has an "autolearn" feature, where mail scoring above 6, and below
> >0.1, will be "autolearned" as spam and ham.  That is, the Baysian
> >classifier will train on these mails.
> 
>     However these only are what SA would have caught already without the 
> Bayesian score.  It discards that when autolearning to prevent a 
> self-spiralling corruption of its database.  On any given pass through d-u, 
> d-d, d-m or d-k I can get 50-60+% messages which were not learned by SA. 
> That's a large amount to discard.

This isn't my understanding.

Remember that the training happens on both sides of the scoring -- both
ham and spam are used to train.

For words which frequently appear in both classes of mail, the
predictive score will be low.  Terms appearing with greater exclusivity
in one or the other will have high absolute scores.

Over time, you'll have fewere words which aren't predictive one way or
the other, though some terms may not predict _much_.

> Does he need to feed every message to SA?  If he has autolearning
> turned on, no.  Should he feed samples in regularly?  Yes.

I'm not quite sure what you're saying here.

My sense:  the autolearning does training for you.  Explicitly training
on false positives/negatives corrects for miss-classified terms or those
not properly scored.  That should improve further accuracy, and be
more-or-less sufficient.

FWIW:  scoring of a particular item of mail will change over time.  I'll
occasionally  come across mis-classified spam in a folder (particularly
one I don't read regularly), check its spam score as noted in headers
(below threshhold), and then run 'spamc -c' to check the current score.
Often it's now _over_ threshold.  I attribute this to either automated
or manual training of the Bayesian classifier.  The differences are
sometimes very marked -- headers note score of 2-3, spamc returns 8-16.


Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    They that can give up essential liberty to obtain a little temporary
    safety deserve neither liberty nor safety.
    - Benjamin Franklin, 1759

Attachment: pgpPLnPx9mJ_D.pgp
Description: PGP signature


Reply to: