Re: Promoting your website with bulk-email
>> Clint Adams <schizo@debian.org> writes:
> There have been lexer changes to better handle non-ASCII messages
> since then, although languages that string words together without
> whitespace still don't get tokenized well.
Yes, you are right on the spot. German is a little nightmare in this
respect (e.g., Erzeugerabfüllung -- and I'm not even trying hard). In
fact, the problems I have with bogofilter are not bogofilter problems
per se, but:
1. These languages (well German and Spanish at least) are different
enough from English to require something a bit more intelligent.
For example, while English has the words (and let me use a very
naïve but nevertheless illustrative example) "see", "sees", "seen"
and "saw", Spanish has "veo", "ves", "ve", "vemos", "veis", "ven",
"visto", "vi", "viste", "vió", "vimos", "visteis", "vieron". The
number of forms for a given word is much larger than in English.
Since bogofilter stores just words in the database (and AFAIK, it
doesn't employ any distance-based algorithm), your corpus has to be
much larger than in the English case, which leads me to the other
problem:
2. The messages I can use for training are mostly in English, with a
little faction in German and Spanish. That means that for the SPAM
in English, bogofilter does a passable work. For email in Spanish
and German, I have false positives more often because the training
hasn't been that good -- and I'd even dare say it has a bias towards
false positives.
I've been trying a new strategy for the last few days. I now have
SpamAssassin as tier one, since it does a superb job at catching the
real junk and behind that I have bogofilter. That's been working quite
well. For example, SpamAssassin lets Nigerian-scam kind of stuff thru
most of the time, since it doesn't score high enough. But ever since
telling bogofilter a couple of times that that's SPAM, it's been
catching it all right, and it leaves my other email alone. In fact, I
haven't got a single piece of SPAM in my INBOX for a almost a week now
and I there hasn't been any false positives either. I had to tweak
SpamAssassin's weights a bit because our local mail system is horribly
broken, though (that is, I had to disable certain header checks).
--
Marcelo | This signature was automatically generated with
mmagallo@debian.org | Signify v1.07. For this and other cool products,
| check out http://www.debian.org/
Reply to: