Re: bayesian filter training question
On torsdag 29 september 2005, 21:51, Roberto C. Sanchez wrote:
> So, I finally decided to get with the 20th century and install
> spamassassin (acutally spampd hooked through postfix) to do site-wide
> spam filtering for my server.
> My question is this. As I am training
> it with sa-learn, is it (good|bad|indifferent) to train it on spam
> that has already been flagged as spam. That is, will this reinforce
> spamassassin's notion of spam or ruin it?
No, that's fine. In fact, SA has this autowhitelist concept that does
exactly that (it's not really a whitelist, though, more an "evening out
weird things that may happen", I'm not using it).
You should have a good look at bayes_ignore_header, so that it won't
train on things that are obviously in spam. SA is pretty good it this
itself, but if you see spam that has been filtered elsewhere a lot, be
sure to use it.
I'm guessing that you, like me, are doing this for your family. In that
case, I have found that it is quite sufficient to train a single
database with the spam and ham of the entire family. If you have more
diverse users, you would probably need to have a per-user
configuration. For example, a friend of mine has an uncle who is a
psychiatrist working with people with gambling obsessions, and SA was
pretty catastrophic for him until he got a per-user config.
Finally, I found that SA, in it's default 3.0-form was much too
conservative about the assigned scores, so I have a bunch of rules that
I have adjusted the score of. You'll get some experience about that in
time, I guess. Also note that SA 3.1 has been released upstream.
Programmer / Astrophysicist / Ski-orienteer / Orienteer / Mountaineer
Homepage: http://www.kjetil.kjernsmo.net/ OpenPGP KeyID: 6A6A0BBC