[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: smtp time spam filtering



On Sat 2007-02-24 21:03:11 -0500 Greg Folkert wrote:
> On Sat, 2007-02-24 at 23:24 +0100, Matus UHLAR - fantomas wrote:
> > > > 	On Fri, Feb 23, 2007 at 03:33:00PM +0000, David Hart wrote:
> > > > > I must be missing something here.  In order to scan an email you must
> > > > > receive the email (I don't mean accept).  How can rejecting/accepting
> > > > > emails at this stage make any significant difference in bandwith used
> > > > > (let alone a quadrupling of bandwidth)?
> > 
> > > On Fri 2007-02-23 08:16:48 -0800 Andrew Sackville-West wrote:
> > > > isn't it just using RBL's at smtp time and rejecting before recieving
> > > > the mail? 
> > 
> > On 23.02.07 19:15, David Hart wrote:
> > > AFAIU no, but that's the way I do it with postfix.  Both my primary
> > > and secondary MXs do RBL checks and stuff like recipient validation
> > > and then make the accept/reject decision after the RCPT TO: but before
> > > the DATA.
> > > 
> > > Greg Folkert said that he uses SA-Exim (which calls spamassassin)
> > > to do scans at smtp time but without any online checks.  I don't see
> > > how you can do this without receiving the bulk of the email.
> > 
> > the advantage of smtp time rejection is, you will just reject the data with
> > error and you don't have to do anything with it - the rest is up to sender.
> > Especially if you would bounce the e-mail, you'll win this way...
> 
> Bouncing... bingo. If the sender doesn't handle it properly, it isn't my
> problem.

You've already outlined a case where bouncing spam became your problem.
You said in an earlier mail "I used to not whitelist murphy, but that
got me auto-unsub'd from (most) Debian lists I subscribe to, for
"bouncing" the SPAM"

I used to filter mail lists through spamassassin but I stopped doing
that a long time ago as, of the lists that I subscribe to, I was getting
almost as many false positives as spam caught.  It just wasn't worth the
bother.

It's not the fact that a little bit of spam slips through that concerns
me (some is inevitable) so much as WHERE it slips into.  If spam gets
into a mailbox that I'm MONITORING for incoming mail then it interupts
me while I check it out and deal with it.  THAT pisses me off.  A couple
of spams a day on the debian list is no big deal (I don't monitor mail
lists for mail as it comes in).

I understand entirely your desire to reject as much spam as you can
as early as you reasonably can - that's what I do - but there comes
a point, I think, where the costs start outweighing the benefits.

I'll give you a few (approximate) numbers of smtp time rejects on my
MXs from yesterday as an example.

My primary MX:

  11	domain not found
  6	invalid name
  4	you are not tonix.org (pretending to be me)
  26	need fully qualified hostname
  117	rejected by one of four RBLs I check
  150	user unknown in local recipient table

My secondary MX:
  (My primary MX was up all the time yesterday so no legitimate smtp
  servers should be contacting the secondary.)

  1	illegal address syntax
  7	need fully qualified hostname
  73	rejected by RBLs
  11	user unknown in relay recipient table

I filter the spam that spamassassin catches into three folders based
on the score.  Yesterday I caught:

  7	score >= 9
  1	score >= 4 (my spam threshold)
  2	score >= 1 (I check this one often)

Total spam rejected at smtp time - 406
Total spam caught by SA - 10 
Total spam that made it into a mailbox that I monitor in real time
for incoming mail - ZERO

Yay!  100% success on my objective ^_^

This is a fairly typical day.  Usually I don't reject quite as much at
smtp time and three or four more spams reach spamassassin.  Only a couple
of spams per month reach one of my inboxes.

> I receive up to the 5K of message section. Then SA-Exim pauses the
> connection for a bit... doing its job.
[snip]

I wondered whether you might be doing something like that.

I don't use exim nor, obviously, sa-exim but I did find the following at
http://marc.merlins.org/linux/exim/files/sa-exim.conf.

  # How much of the body we feed to spamassassin (in bytes)
  # Default is 250KB
  SAmaxbody: 256000

  # Do you want to feed SAmaxbody's worth of the message body if it is
  # too big?
  # Either, you skip messages that are too big and not scan them, or you
  # can
  # truncate the body and feed that to SA.
  # Note that SA will sometimes raise the spam score if it can't parse
  # the message correctly (since the end is missing, decoding will fail)
  # Default is 0: do not scan messages that are too big
  # (note that this is parsed as a condition)
  SATruncBodyCond: 0

So it would seem that by truncating the feed to spamassassin you may
be increasing your risk of false positives.

And I still don't see how turning off sa-exim makes you use four
times the bandwidth.

Also, the figures you gave for receive and scan delays don't seem
quite right to me.  ~3s receive delay without sa-exim on a box that
can put a message through spamassassin in ~1s seems slow to me.

Here's some more figures from my primary MX but first let me tell
you its spec and what it's running.

PII 350MHz 96MB ram running: postfix (duh!), ntpd, apache, bind9,
nfs-server and openvpn.  Plus, for the last few weeks (since my last
remaining P4 blew up with a loud bang) it's been running X, fluxbox,
firefox etc and been serving as my (hopefully temporary) workstation.
According to 'free' it's currently using about 85MB swap.

Despite this, over the last two days, my box still managed to deliver
582 messages locally with an average delay of 1.17s.  This average
includes both lookup time on up to 4 RBLs and the time taken to process
~17 messages through spamassassin (and on this old crock it takes ~25s
for spamassassin to process a message).  As postfix accepts the mail
before spamassassin finishes its work, the average delay is probably
closer to 1s.

For example, over the same period, I had 297 mails from the debian
list (which do not go through spamassassin but do lookup the RBLs)
and the average was 0.77s per message.

So I seem to to be getting ~4 times or more the speed of your box on a
'pile of old junk' that's 'filled to the gills' with processes.

Increasing receive delay by a factor of 4 may not sound like much to
you when your box has memory and cycles to burn but it's a different
story if you're running a busy mail server that's working hard.
4 times the delivery latency translates into many times the processes
running which means buying hardware which costs money.  People who are
connecting to you to deliver mail are paying that cost on your behalf.

If you were to run the full mail through SA (in order to improve its
accuracy) the difference would be even greater.  Perhaps 6 or more
times slower than my pile of old junk?

And, you don't even seem to be able to scan all the mail that you
want through SA as you need to whitelist mailing lists.

If you were teergrubing the bad guys I'd applaud you for that but,
you could teergrube the bad guys without smtp time scanning with 95%
the rate that you could now _without_ hitting the good guys.

I still can't see one advantage with smtp-time scanning.

-- 
David Hart <debian@tonix.org>



Reply to: