[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RE: Promoting your website with bulk-email



Hello,

I would say on those points that feeding with too many samples breaks the Bayesian filter power. Additionally, feeding with the same amount of each helps a lot: for my "home made" bayesian filter, I run with 10k good mails and 8k spam mails and get up to 85% positive, 15% false negative.

I agree to the fact that Bayesian filters alone are not enough, but for my case, Bayesian filters and regular expressions are enough. If one fails, the other match.

Best regards,
Emmanuel.
 

-----Original Message-----
From: KELEMEN Peter [mailto:fuji@debian.org] 
Sent: Wednesday, July 02, 2003 4:58 PM
To: debian-curiosa@lists.debian.org
Cc: Emmanuel Ormancey

* Mark Brown (broonie@sirena.org.uk) [20021029 18:35]:

> Another possibility: in my experience bogofilter seems to work
> better when it has seen very much more non-spam than spam
> e-mail.  As I recall your data set was about evenly split
> between the two.

Well, I just can't get enough ham. :-) Recently I did the same
test again with 50k spam and 15k ham.  Both SpamAssassin and
bogofilter were trained with the full spam corpus and full
ham corpus, then run against a set of 1387 previously unseen,
human-verified spam messages.  Result are:
 
bogofilter: 306 (22%) positives, 1081 (88%) false negatives
spamassassin: 580 (42%) positives, 807 (58%) false negatives

Training bogofilter with an additional 30k of ham the result
improved somewhat:

bogofilter: 631 (45%) positives, 756 (55%) false negatives

Training SpamAssassin with the same 30k additional ham failed with
OOM on a 256M RAM P4 machine.

>From the results, it is clear to me that Bayesian spam filtering
alone is still not good enough to catch most of spam.  If time
permits, I'll look into CRM114 and others.

Peter

-- 
    .+'''+.         .+'''+.         .+'''+.         .+'''+.         .+''
 Kelemen Péter     /       \       /       \       /    fuji@debian.org
.+'         `+...+'         `+...+'         `+...+'         `+...+'




Reply to: