[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Spamassassin, keep feeding messages for bayes?



Karsten M. Self wrote:
on Sat, Nov 15, 2003 at 06:05:03AM -0800, Steve Lamb (grey@dmiyu.org) wrote:
Karsten M. Self wrote:

SA has an "autolearn" feature, where mail scoring above 6, and below
0.1, will be "autolearned" as spam and ham.  That is, the Baysian
classifier will train on these mails.

However these only are what SA would have caught already without the Bayesian score. It discards that when autolearning to prevent a self-spiralling corruption of its database. On any given pass through d-u, d-d, d-m or d-k I can get 50-60+% messages which were not learned by SA. That's a large amount to discard.

This isn't my understanding.

    From man mail::spamassassin::conf...

       bayes_auto_learn ( 0 | 1 )      (default: 1)
           Whether SpamAssassin should automatically feed high-scoring mails
           (or low-scoring mails, for non-spam) into its learning systems.
           The only learning system supported currently is a naive-Bayesian-
           style classifier.

           Note that certain tests are ignored when determining whether a mes-
           sage should be trained upon:
            - auto-whitelist (AWL)
            - rules with tflags set to 'learn' (the Bayesian rules)
            - rules with tflags set to 'userconf' (user white/black-listing
           rules, etc)

Remember that the training happens on both sides of the scoring -- both
ham and spam are used to train.

I know, but when I read the above I tried to figure out why they would ignore the Bayesian score when determining whether or not to autolearn something as spam or ham. The only thing I could figure was that the Bayesian score can shift spam or ham into that neutral territory (0.2 to 11.99 as of 2.6, it was wider in 2.5x) where it would not be learned. If enough of that happened one would be learning only spam or only ham and the autolearning would be thrown off. Remember that the Bayesian filtering can shift a message +/-5 depending on other scoring. That is enough of a shift to defeat a lot of ham message's score (-3 to -4 area).

As such Spamassassin's autolearning is only learning from messages that Spamassassin would, by its own internal rules, be catching as spam or ham anyway. It is up to the individual to feed it messages which SA normally would not catch one way or another to the Bayesian classifier so that it may learn. This should be done even if the Bayesian classifier correctly identified the message as either spam or ham because if SA did not see it as one or the other it won't get learned. Most of the messages to my parents, to a personal friend and the administrator of my secondary MX, to my fiancee all are not caught or trained on by SA yet they represent a good portion of my mail I think should be trained upon.

 >>Does he need to feed every message to SA?  If he has autolearning
turned on, no.  Should he feed samples in regularly?  Yes.

I'm not quite sure what you're saying here.

My sense:  the autolearning does training for you.  Explicitly training
on false positives/negatives corrects for miss-classified terms or those
not properly scored.  That should improve further accuracy, and be
more-or-less sufficient.

True. However I did read somewhere that an informal study (can't remember if it was Spambayes, Spamassassin or something else and Google is failing me now) showed that mistake based training is not as effcient as just periodically training the classifier with a corpus from either side of the equation even if it got it correct in most cases. SA's autotraining does meet this but the end result, without the user adding his own, is that the spam autotrained by SA is only what SA would have caught anyway. It makes the Bayesian portion something of a duplication of effort, really. My interest in the Bayesian portion is that it be able to help SA catch what SA would not normally catch in the first place. To do that I have to feed it both examples of ham and spam that SA does not catch but the Bayesian classifier might have caught. This would increase the accuracy of the classifier, swing the score more in one direction or the other and that in turn will higher or lower scores on the BAYES set.

FWIW:  scoring of a particular item of mail will change over time.  I'll
occasionally  come across mis-classified spam in a folder (particularly
one I don't read regularly), check its spam score as noted in headers
(below threshhold), and then run 'spamc -c' to check the current score.
Often it's now _over_ threshold.  I attribute this to either automated
or manual training of the Bayesian classifier.  The differences are
sometimes very marked -- headers note score of 2-3, spamc returns 8-16.

While the inner set is possible (3 to 8) without network checks on as a neutral message gets a 0 score from the Bayesian classifier while a 99%+ gets +5 the higher swings are very unlikely as one would have to have a message go from a 0-1% probability (-5.4 if memory serves) to a 99%+ probability (+5) and also be doing no network checks. That takes a lot to swing from one direction to the other.

BTW, in composing this I did find confirmation of my suspicions on why SA doesn't count the BAYES scores when deciding whether or not to autolearn:

    From man sa-learn...

       2. Unsupervised learning from Bayesian classification
           Another way to train is to chain the results of the Bayesian clas-
           sifier back into the training, so it reinforces its own decisions.
           This is only safe if you then retrain it based on any errors you
           discover.

           SpamAssassin does not support this method, due to experimental
           results which strongly indicate that it does not work well, and
           since Bayes is only one part of the resulting score presented to
           the user (while Bayes may have made the wrong decision about a
           mail, it may have been overridden by another system).


--
         Steve C. Lamb         | I'm your priest, I'm your shrink, I'm your
       PGP Key: 8B6E99C5       | main connection to the switchboard of souls.
-------------------------------+---------------------------------------------

Attachment: pgperXm0KdgL8.pgp
Description: PGP signature


Reply to: