Re: SpamAssassin now used to filter BTS Mail
On Thu, 17 Oct 2002, Duncan Findlay wrote:
> The scores produced for SpamAssassin are determined based on a corpus
> of spam and nonspam provided by volunteers. Although it may have a
> slight technical bias, we try to include as much commercial non-spam
> and legitimate mailing lists as possible.
> If we were able to base scores solely on the kind of mail we recieve
> for the BTS, it will be able to filter more effectively. Think about
> it as optimising SpamAssassin for a specific type of mail.
> I would estimate that customised (evolved) scores would cut the
> false-negatives at least by half, and the false-positives even more.
> The problem involves the creation of the corpuses, on which the scores
> must be based. Any spam in a non-spam corpus (or vice versa) would
> have a huge impact. The corpuses don't have to be _too_ large. The
> corpuses used for the default scores are 33k spam, 170k non-spam, but
> we'd probably get decent results with a total of about 20k (with the
> split about equal to the split of mail recieved by the BTS)
Well, since the start of this spam scanning for the bts, there hasn't been
much mail. In fact, since I started sending it all thru procmail(Sep 26),
there have only been 19258 mails.
However, we can use the entire bts as a corpus. It's possible to extract
*ALL* mail that has been sent to the bts in the past, and feed that thru SA,
then fine tune the output for real spam.
However, this has been hampered lately, as I have told people to mail
owner@bugs to remove spam from their reports. :(