[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Spam filters



Hi folks,

	Since I sent mail here about spam filtering,  Scott Blachowicz
 <scott@statsci.com> has cleaned up the script involved in getting the
 information about spam listings, making on script get information
 from a number of sites (can override on command line), optionally
 split the output into multiple files, and format the output either
 for sendmail or mailagent (the default being[what else?] mailagent). 

	I hope this is useful. 

	manoj
-- 
 Sir, the cow she walks. She talks. She's full of chalk.  The lactose
 secretions of the female of the bovine species are highly desirable
 to the n'th degree. A West Point Cadet's answer to, "How's the Cow?",
 which roughly translates to, "How many servings of milk are left upon
 the table?".  (The "n'th" indicates the number of servings).
Manoj Srivastava               <url:mailto:srivasta@acm.org>
Mobile, Alabama USA            <url:http://www.datasync.com/%7Esrivasta/>

#! /usr/bin/perl
require "getopts.pl";

#  Scott Blachowicz <scott@statsci.com>
# -l LISTNAME (colon sep list of indices into %urls hash)
# -o OUTPUT  (base prefix for split lists, filename for merged list)
# -s (split lists into individual files - default is on for 'mailagent' &
#     off for others)
# -S (turn off splitting)
# -t TYPE_OF_OUTPUT ("sendmail" or "mailagent" - default "mailagent")
# -v (verbose)
&Getopts ('l:o:sSt:v');

my $lists = defined $opt_l ? $opt_l : 'ALL';
my $spam_base = defined $opt_o ? $opt_o : "$ENV{'HOME'}/etc/.spamlist";
my $output_type = defined $opt_t ? $opt_t : "mailagent";
my $split_lists = defined $opt_s || ($output_type eq "mailagent");
$split_lists = 0 if defined $opt_S;
my $verbose = defined $opt_v;

use strict;

use Sys::Hostname;
my $host = hostname();
if ($host !~ /\./) {
  # Try to add a domain name?
  my ($name, $aliases, $addrtype, $length, @addrs) = gethostbyname($host);
  my @aliases = grep(/\./,split(/\s+/,$aliases));
  $host = $aliases[0] if @aliases;
}

use URI::Escape;
my $ftpuser = "ftp:despammer%40" . uri_escape ("$host");

use LWP::Simple;

my %urls = 
  ('aol', "http://www.idot.aol.com/preferredmail/";,
   'mindspring', "http://www.atl.mindspring.com/cgi-bin/spamlist.pl";,
   'znet', "http://www.znet.com/spammers.txt";,
   ## too many bad matches: 'wsrcc', "http://www.wsrcc.com/spam/spamlist.txt";,
   'iocom', "http://www.io.com/help/killspam.php";,
   'nancynet', "ftp://ftp.cybernothing.org/pub/abuse/nancynet.domains";,
   'cyberpromo', "ftp://ftp.cybernothing.org/pub/abuse/cyberpromo.domains";,
   'llv', "ftp://ftp.cybernothing.org/pub/abuse/llv.domains";,
  );

my %parsers = 
  ('aol', '&parse_aol($_)',
   'mindspring', '&parse_mindspring($_)',
   'iocom', '&parse_iocom($_)',
  );
my %unspam = 
  (
   'concentric.net', 'non-spam emails', #wsrcc
   'demon.net', 'non-spam emails',    #wsrcc
   'hotmail.com', 'free email used by non-spammers as well',
   'interactive.net', 'non-spam emails',    #znet
   'mindspring.com', 'non-spam emails',    #wsrcc
   'psi.net', 'non-spam emails',    #wsrcc
   'shoppingplanet.com', 'non-spam emails',
   'vnet.net', 'non-spam emails',   #wsrcc
   'yoyo.com', 'non-spam emails',
  );

if (! $split_lists) {
  open OUT, ">${spam_base}" or die  "create ${spam_base}: $!";
}

my $site;
foreach $site (keys %urls) {
  print "# Processing '$site' at URL $urls{$site}\n" if $verbose;
  
  if ($_ = get $urls{$site}) {
    if ($split_lists) {
      open OUT, ">${spam_base}-$site" or
	die  "create ${spam_base}-$site: $!";
    }
    
    ## 1) Filter out duplicate sites if going to one spamlist file.
    ## 2) Filter out '#'-started comments.
    ## 3) Filter out blank lines.
    ## 4) be sure $1 is what you want in the annotation
    ## 5) if no @ char, stick a "any user/any subdomain/host" regexp in.
    print OUT map {s/\#.*$//;
		   /\S/ && eval "\&filter_${output_type}(\$_)";
		 } grep((!$unspam{$_} &&
			 ($split_lists || !$unspam{$_}++)),
			($parsers{$site} ?
			 eval "$parsers{$site}" : split /\n/));
    close OUT if $split_lists;
  }
  else {
    warn "Cannot get $urls{$site}\n";
  }
}
close OUT if !$split_lists;

sub filter_mailagent {
    local($_) = @_;
    "/^(" . (/\@/ ? "" : "(.*[\@.])?") . "\Q$_\E)\$/i\n";
}

sub filter_sendmail {
    local($_) = @_;
    "$_ " . (/\@/ ? "SPAMMER" : "JUNK") . "\n";
}

sub parse_aol {
    local($_) = @_;
    if (! s/^[\s\S]*<MULTICOL.*\n//) {
        warn "parse_aol: missing MULTICOL in $_ ";
        return ();
    }
    if (! s/<\/MULTICOL[\s\S]*//) {
        warn "parse_aol: missing /MULTICOL in $_ ";
        return ();
    }
    split /\n/;
}

sub parse_mindspring {
    local($_) = @_;
    if (! s,^[\s\S]*?<pre>[^\n]*\n,,) {
        warn "parse_mindspring: can't find block of hostnames";
        return ();
    }
    if (! s,</pre>[\s\S]*?<pre>[^\n]*\n,,) {
        warn "parse_mindspring: can't find block of email addresses";
        return ();
    }
    s,</pre>[\s\S]*$,,;
    split /\n/;
}

sub parse_iocom {
    local($_) = @_;
    if (! s,^[\s\S]*?<H(\d)>Blocked\s*Domains</H\1>[\s\S]*?<TABLE[^\n]*\n,,) {
        warn "parse_iocom: can't find 'Blocked Domains' table";
        return ();
    }
    if (! s,</TABLE.*,,) {
        warn "parse_iocom: can't find end of 'Blocked Domains' table";
        return ();
    }
    s,<[^>]+>,,g;
    split /[\s\n]+/;
}

## This version written by:
##  Scott Blachowicz <scott@statsci.com>
### Originally from...
### 
### which creates a series of lines in "~/.spamlist" that look like:
### 
###     /^((.*[@.])?1floodgate\.com)$/i
###     /^((.*[@.])?205\.254\.167\.57)$/i
### 
### which happen to be directly useful in .rules lines that look like this:
### 
###     ## flag spam (thank you, AOL!)
###     <TO_MERLYN> Envelope From Sender Relayed Reply-To: "~/.spamlist" {
### 	    ANNOTATE -d X-merlyn-spam Smells like spam from %1;
### 	    ## eventually, file in list.spam or delete,
### 	    ## but for now, just testing...
### 	    REJECT;
###     };
### 
### This is still a work in progress, but I thought I'd publish this alpha
### release in case anyone else wanted to hack along with me.
### 
### -- 
### Name: Randal L. Schwartz / Stonehenge Consulting Services (503)777-0095
### Keywords: Perl training, UNIX[tm] consulting, video production, skiing, flying
### Email: <merlyn@stonehenge.com> Snail: (Call) PGP-Key: (finger merlyn@ora.com)
### Web: <A HREF="http://www.stonehenge.com/merlyn/";>My Home Page!</A>
### Quote: "I'm telling you, if I could have five lines in my .sig, I would!" -- me


--
TO UNSUBSCRIBE FROM THIS MAILING LIST: e-mail the word "unsubscribe" to
debian-user-request@lists.debian.org . 
Trouble?  e-mail to templin@bucknell.edu .


Reply to: