[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Promoting your website with bulk-email



>> Clint Adams <schizo@debian.org> writes:

 > There have been lexer changes to better handle non-ASCII messages
 > since then, although languages that string words together without
 > whitespace still don't get tokenized well.

 Yes, you are right on the spot.  German is a little nightmare in this
 respect (e.g., Erzeugerabfüllung -- and I'm not even trying hard).  In
 fact, the problems I have with bogofilter are not bogofilter problems
 per se, but:

 1. These languages (well German and Spanish at least) are different
    enough from English to require something a bit more intelligent.
    For example, while English has the words (and let me use a very
    naïve but nevertheless illustrative example) "see", "sees", "seen"
    and "saw", Spanish has "veo", "ves", "ve", "vemos", "veis", "ven",
    "visto", "vi", "viste", "vió", "vimos", "visteis", "vieron".  The
    number of forms for a given word is much larger than in English.
    Since bogofilter stores just words in the database (and AFAIK, it
    doesn't employ any distance-based algorithm), your corpus has to be
    much larger than in the English case, which leads me to the other
    problem:

 2. The messages I can use for training are mostly in English, with a
    little faction in German and Spanish.  That means that for the SPAM
    in English, bogofilter does a passable work.  For email in Spanish
    and German, I have false positives more often because the training
    hasn't been that good -- and I'd even dare say it has a bias towards
    false positives.

 I've been trying a new strategy for the last few days.  I now have
 SpamAssassin as tier one, since it does a superb job at catching the
 real junk and behind that I have bogofilter.  That's been working quite
 well.  For example, SpamAssassin lets Nigerian-scam kind of stuff thru
 most of the time, since it doesn't score high enough.  But ever since
 telling bogofilter a couple of times that that's SPAM, it's been
 catching it all right, and it leaves my other email alone.  In fact, I
 haven't got a single piece of SPAM in my INBOX for a almost a week now
 and I there hasn't been any false positives either.  I had to tweak
 SpamAssassin's weights a bit because our local mail system is horribly
 broken, though (that is, I had to disable certain header checks).

-- 
Marcelo             | This signature was automatically generated with
mmagallo@debian.org | Signify v1.07.  For this and other cool products,
                    | check out http://www.debian.org/



Reply to: