Re: baysian filtering (was: Re: Massive increase of spam on debian-*@l.d.o)
On Wed, 5 May 2004 13:12:51 -0000
"Monique Y. Mudama" <spam@bounceswoosh.org> wrote:
> Anyway, I dutifully pipe them through sa-learn, but I worry. If these
> spams look so much like regular mail, won't I just end up tainting my
> baysian library by teaching sa-learn with them? I mean, eventually,
> won't my baysian scheme be unable to distinguish between spam and ham?
>
> Thoughts?
If it looks at the headers as well as the body, as Bogofilter does, that
should help it to distinguish. Also what you define as ham is surely
more than just well-formed grammar etc. Your corpus of ham messages
surely contains either a different collection of words or words with
different frequency of occurence than spam messages, and if you train it
right, a good bayesian system should be able to see the difference. I
should have thought you would only have problems if your ham normally
contains a lot of long-winded jokes similar to the spam, and the spam
comes from sources that your ham normally comes from.
- Richard
--
Richard Kimber
http://www.psr.keele.ac.uk/
Reply to: