Re: Attempts to poison bayesian systems

To: debian-security@lists.debian.org
Subject: Re: Attempts to poison bayesian systems
From: "Karsten M. Self" <kmself@ix.netcom.com>
Date: Sun, 28 Dec 2003 15:12:21 -0800
Message-id: <[🔎] 20031228231221.GI3556@ix.netcom.com>
Mail-followup-to: debian-security@lists.debian.org
In-reply-to: <[🔎] 20031223132530.GA9089@vnl.com>
References: <[🔎] 20031223132530.GA9089@vnl.com>

on Tue, Dec 23, 2003 at 01:25:30PM +0000, Dale Amon (amon@vnl.com) wrote:
> I've been noticing loads of mails like this lately:
> 
>   Date: Sun, 21 Dec 2003 16:25:34 +0500
>   From: "Joseph Jenkins" <qyzeji@canada.com>
>   Subject: Re: MIT, rest in peace!
>   To: admin@vnl.com
>   X-Mailer: mPOP Web-Mail 2.19
> 
>   emery atrocious larval drippy elate incontrollable raster anglicanism
>   checkerberry feed sit ajar saturable decathlon
>   already climate inhibition pagoda narcissus expository toni
> 
> I can only assume someone out there is trying to attack bayesian
> systems by loading them up with all sorts of normal words so that good
> mail gets false positives, thus breaking the systems.

The success of this sort of attack on Bayesian filters is likely to be
weak at best.

See Paul Graham's commentary on this:

    So Far, So Good
    August, 2003
    http://www.paulgraham.com/sofar.html

Spammers can attempt to bypass Bayesian filters by using fewer bad
tokens, or more good tokens (as Dale notes).  That's it.

Seeding content with more neutral tokens tends to make the body more,
well, neutral.  Unless specifically non-spammy tokens are used, there's
little net effect.  Unseen words have a slightly spammy weighting in
Graham's work.

Note too that, at least for Graham's Bayesian algorithm, the computation
of spamminess is based on the most "interesting" 15 tokens.  So adding a
bunch of neutral chaff to a message doesn't mask the fact that it
contains a large number of spammish keywords.  _Random_ padding won't be
effective.  _Targeted_ padding will be, though spammers would have to
target the non-spam keyword list of individual recipients to be highly
effective (guessing wrong simply adds to the spamminess of an
individual's keyword list).

    A Plan for Spam
    August, 2002
    http://www.paulgraham.com/spam.html

I've seen a few chaffed message slip past my filters in recent weeks,
but I dump these to a 'spam-learn' folder which is crawled by sa-learn
every 30 minutes (cronjob), after a few days of which the chaffed
messages aren't appearing in my "greylist" box (previously unknown
senders).

I also maintain a whitelist which is the only way a given user can end
up in my inbox.  Mailing lists collect some spam, but not much.

Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
   At the sound of the toner, boycott Lexmark:  trade restraint via DMCA.
    http://news.com.com/2100-1023-979791.html

Attachment: pgpeToIJ7rTmM.pgp
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Attempts to poison bayesian systems
  - From: Kjetil Kjernsmo <kjetil@kjernsmo.net>

References:
- Attempts to poison bayesian systems
  - From: Dale Amon <amon@vnl.com>

Prev by Date: Re: [Samba] Faked samba packages / rootkit?
Next by Date: Re: IPSec WinXP interop
Previous by thread: Re: Attempts to poison bayesian systems
Next by thread: Re: Attempts to poison bayesian systems
Index(es):
- Date
- Thread