Re: Spamassassin, keep feeding messages for bayes?

To: debian-user@lists.debian.org
Subject: Re: Spamassassin, keep feeding messages for bayes?
From: Steve Lamb <grey@dmiyu.org>
Date: Tue, 18 Nov 2003 09:45:51 -0800
Message-id: <[🔎] 3FBA5ACF.8030402@dmiyu.org>
In-reply-to: <[🔎] 20031118110003.GC12945@ix.netcom.com>
References: <[🔎] 20031115071208.GC28593@ix.netcom.com> <[🔎] 3FB6328F.3070908@dmiyu.org> <[🔎] 20031118110003.GC12945@ix.netcom.com>

Karsten M. Self wrote:

on Sat, Nov 15, 2003 at 06:05:03AM -0800, Steve Lamb (grey@dmiyu.org) wrote:

Karsten M. Self wrote:

SA has an "autolearn" feature, where mail scoring above 6, and below
0.1, will be "autolearned" as spam and ham.  That is, the Baysian
classifier will train on these mails.

However these only are what SA would have caught already without theBayesian score. It discards that when autolearning to prevent aself-spiralling corruption of its database. On any given pass through d-u,d-d, d-m or d-k I can get 50-60+% messages which were not learned by SA.That's a large amount to discard.

This isn't my understanding.


    From man mail::spamassassin::conf...

       bayes_auto_learn ( 0 | 1 )      (default: 1)
           Whether SpamAssassin should automatically feed high-scoring mails
           (or low-scoring mails, for non-spam) into its learning systems.
           The only learning system supported currently is a naive-Bayesian-
           style classifier.

           Note that certain tests are ignored when determining whether a mes-
           sage should be trained upon:
            - auto-whitelist (AWL)
            - rules with tflags set to 'learn' (the Bayesian rules)
            - rules with tflags set to 'userconf' (user white/black-listing
           rules, etc)

Remember that the training happens on both sides of the scoring -- both
ham and spam are used to train.

I know, but when I read the above I tried to figure out why they wouldignore the Bayesian score when determining whether or not to autolearnsomething as spam or ham. The only thing I could figure was that the Bayesianscore can shift spam or ham into that neutral territory (0.2 to 11.99 as of2.6, it was wider in 2.5x) where it would not be learned. If enough of thathappened one would be learning only spam or only ham and the autolearningwould be thrown off. Remember that the Bayesian filtering can shift a message+/-5 depending on other scoring. That is enough of a shift to defeat a lot ofham message's score (-3 to -4 area).

As such Spamassassin's autolearning is only learning from messages thatSpamassassin would, by its own internal rules, be catching as spam or hamanyway. It is up to the individual to feed it messages which SA normallywould not catch one way or another to the Bayesian classifier so that it maylearn. This should be done even if the Bayesian classifier correctlyidentified the message as either spam or ham because if SA did not see it asone or the other it won't get learned. Most of the messages to my parents, toa personal friend and the administrator of my secondary MX, to my fiancee allare not caught or trained on by SA yet they represent a good portion of mymail I think should be trained upon.


 >>Does he need to feed every message to SA?  If he has autolearning

turned on, no.  Should he feed samples in regularly?  Yes.

I'm not quite sure what you're saying here.

My sense:  the autolearning does training for you.  Explicitly training
on false positives/negatives corrects for miss-classified terms or those
not properly scored.  That should improve further accuracy, and be
more-or-less sufficient.

True. However I did read somewhere that an informal study (can'tremember if it was Spambayes, Spamassassin or something else and Google isfailing me now) showed that mistake based training is not as effcient as justperiodically training the classifier with a corpus from either side of theequation even if it got it correct in most cases. SA's autotraining does meetthis but the end result, without the user adding his own, is that the spamautotrained by SA is only what SA would have caught anyway. It makes theBayesian portion something of a duplication of effort, really. My interest inthe Bayesian portion is that it be able to help SA catch what SA would notnormally catch in the first place. To do that I have to feed it both examplesof ham and spam that SA does not catch but the Bayesian classifier might havecaught. This would increase the accuracy of the classifier, swing the scoremore in one direction or the other and that in turn will higher or lowerscores on the BAYES set.

FWIW:  scoring of a particular item of mail will change over time.  I'll
occasionally  come across mis-classified spam in a folder (particularly
one I don't read regularly), check its spam score as noted in headers
(below threshhold), and then run 'spamc -c' to check the current score.
Often it's now _over_ threshold.  I attribute this to either automated
or manual training of the Bayesian classifier.  The differences are
sometimes very marked -- headers note score of 2-3, spamc returns 8-16.

While the inner set is possible (3 to 8) without network checks on as aneutral message gets a 0 score from the Bayesian classifier while a 99%+ gets+5 the higher swings are very unlikely as one would have to have a message gofrom a 0-1% probability (-5.4 if memory serves) to a 99%+ probability (+5) andalso be doing no network checks. That takes a lot to swing from one directionto the other.

BTW, in composing this I did find confirmation of my suspicions on why SAdoesn't count the BAYES scores when deciding whether or not to autolearn:


    From man sa-learn...

       2. Unsupervised learning from Bayesian classification
           Another way to train is to chain the results of the Bayesian clas-
           sifier back into the training, so it reinforces its own decisions.
           This is only safe if you then retrain it based on any errors you
           discover.

           SpamAssassin does not support this method, due to experimental
           results which strongly indicate that it does not work well, and
           since Bayes is only one part of the resulting score presented to
           the user (while Bayes may have made the wrong decision about a
           mail, it may have been overridden by another system).


--
         Steve C. Lamb         | I'm your priest, I'm your shrink, I'm your
       PGP Key: 8B6E99C5       | main connection to the switchboard of souls.
-------------------------------+---------------------------------------------

Attachment: pgperXm0KdgL8.pgp
Description: PGP signature

Reply to:

References:
- Re: Spamassassin, keep feeding messages for bayes?
  - From: "Karsten M. Self" <kmself@ix.netcom.com>
- Re: Spamassassin, keep feeding messages for bayes?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: Spamassassin, keep feeding messages for bayes?
  - From: "Karsten M. Self" <kmself@ix.netcom.com>

Prev by Date: Re: Theoretical APT question
Next by Date: Re: spamc not putting headers on some mail?
Previous by thread: Re: Spamassassin, keep feeding messages for bayes?
Next by thread: Re: Spamassassin, keep feeding messages for bayes?
Index(es):
- Date
- Thread