[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: The New World Order is Here !



* Oleksandr Moskalenko <malex@purdue.edu> spake thus:
> 
> Stig,
> 
> Would you mind sharing your set of procmail filters?

I don't mind at all. In fact, that has been the goal all the time, hence
it is vigorously commented and implemented with "easy tuning" variables
at the top for people that find procmails syntax a bit terse. I just
wanted to test it a bit more first.

My understanding of attachments and/or MIME is a bit poor at the moment,
so the code for that part may a bit embarrasing. I eventually want to be
able to just drop the non-text part of a multipart message "on the
floor" and just leaving a message at the top saying "[was multipart type
X, crap dropped]" similar to the html filter. 

A warning though; this was implemented without reading any RFCs; I have
just tried to tweak it until it filed everything *I* felt was spam in my
spam folder and having as few false positives as possible. I have a few
other ideas not implemented yet as to how to be even more successfull at
that, but I am starting to become very pleased with how this works now.

Let's call it GPL'ed software, shall we? I am not an expert on
licensing, and don't intend to become one, but I might have to read the
GPL one day... It goes without saying that no guarratee is provided with
this software, don't blame me if you lose mail or your system explodes. 

The only other restriction I set is that if you find any errors or have
any suggestions as to how I can improve this software, please let me
know.

Stig

-- 
brautaset.org
Registered Linux User 107343

# my .procmailrc file, from beginning to end.
PATH=$HOME/bin:/usr/local/bin:/usr/bin:/bin
MAILDIR=$HOME/Mail
DEFAULT=in.mbox
LOGFILE=log
VERBOSE=on

# This is the limit at which we consider the message spam.
limit=28

# This is the version number 
version=0.0.5

# Files section.

# This is the folder that spam will go to.
spam_folder=spam

# The Message_Id: headers of messages that are considered spam are stored
# here, and used to automagically black-hole messages that reference it. This
# is useful on mailing lists where follow-up to spam is a problem. 
spam_id=spam_id

# This file is where spammer addresses (possibly partial addresses) are
# stored. Messages to/from any of these are most likely spam, so junk them.
# TODO: make the score for this option configurable as well.
blacklist=blacklist

# This file is the opposite of the one above; mail from these addresses are
# never filed as spam.
whitelist=whitelist

# Sometimes, esp. when you post to various mailing lists or use fetchmail to
# get your mail via a slow connection, you experience getting the same mail
# twice. Set this to 'yes' if you want to recive duplicated messages twice
# (or more!).
recive_duplicated=no

# This file will contain any duplicated messages if the above is set.
duplicates=duplicates


# These variables are here to make the filter a bit easier to tweak; they are
# normal procmail-style regular expressions (CAVEAT: shellmetas must be
# escaped twice.

# A message that does not match any of my addresses in the header is more
# than likely spam.
my_addresses="((stigbrau@start|s-braut@online)\.no|stig@brautaset\.org)"

# If the message is not to me directly.
cc_to_me=5

# If a message is not addressed to me at all (matches mailing lists such as
# debian-user, but I found out that spam on these lists are much more
# frequent and as such it does make sense to give them a head start in the
# point-gathering process).
not_to_me=10

# Messages from these domains are often spam.
crap_domains="(aol|msn|hotmail|earthlink)"
crap_domain_score=9

# These words in the body of the mail is positive.
positive_words="(Stig|Brautaset|:-\))"
positive_words_score=3

# Char sets different than these are not appreciated.
ok_charsets="(latin|iso-8859|Windows-1252|us-ascii|utf-8)"
bad_charset_score=13

# These words in the body counts against it being a legitimate mail. The
# score will be per word, but long mails that mentions the words sparsely
# will be less severely punished than short mails.
negative_words="(money|loose\ weight|click\ (here|to)|save|privacy|free\ |remove\ yourself|test|please\ ignore|ignore\ this|unsubscribe|e-?mail|microsoft|windows)"
negative_words_score=4

# I know where to get my porn, please don't bother me with yours. 
porn="(sex|porn|tits|penis|cock|erection|hardon|horny|slut\ |uncensored|privacy|celeb|hotties|viagra|whore|toonz|adult)"
porn_score=8

# I really detest excessive punctuation. This is for the subject line.
crap_chars="(!|?|\\*|\\$|\||-|\"|\')"

# Same as above, but many people use '-' and '*' in signatures, so this
# regular expression is for the body.
crap_chars_body="(!|?|\\||\\$)"

# This is the regular expression that we should skip when looking at subject
# lines.
re_skip_string="(re:|fwd:|fw:|sv:|\[gllug\])"

# I absolutely hate html mail.
strip_html=yes
html_content_score=13

# If the subject line ends in "[aoe09]" or something similar, it better not
# breach anything else; this is a *very* strong indication on spam.
spam_style_uniq_subject=`echo $(($limit - 1));`

# I don't like lots of quoted material. The following means that the post
# have to have twice as many non-quoted lines as normal lines not to get the
# score. Set it to any value you wish. The score is the one that will be
# added to the message, regardless of severity. This might change.
quote_ratio=2
quote_score=5

# More severity configure options. 
no_subject=17
all_caps_subject=13
not_plain_text=7


# The actual rule set begins here. Little configuration should be needed from
# this point on. 

# If you want a different header to the "X-SPAM:" thingie that is default, it
# can be changed here.
xspam="X-SPAM:"

# Do not just remove this line. Because we add a newline before any of the
# extra X-SPAM: lines below, this variable needs to be set.
message="$xspam This is SpamTracker version $version by Stig Brautaset."
score=0

# Keep a backup copy of the latest 200 messages.
:0 c
backup
:0 ic
| cd backup && rm -f dummy `ls -t msg.* | sed -e 1,200d`

# We don't want to get duplicate mails.
:0 Whc: msgid.lock
* recive_duplicated ?? no
| formail -D 16384 msgid.cache
:0 a:
$duplicates

						 
# This recipe is for adding addresses to the white/blacklists. 
:0
* $ ^To.*$my_addresses
* $ ^From.*$my_addresses
{
	# Add addresses from the subject line to the list of known good
	# senders.
	:0 i:
	* ^Subject:.*white[      ]*\/.+
	| echo $MATCH >> $whitelist
	
	# Add addresses from the subject line to the list of known spammers.
	:0 i:
	* ^Subject:.*black[      ]*\/.+
	| echo $MATCH >> $blacklist

	# Just plonk the thread, not the sender.
	:0 w:
	* ^Subject:.*thread
	| formail +1 -ds formail -x"Message-Id:" >> $spam_id

	# If I forward spam to myself, the sender(s) of the original mail
	# should be blacklisted, and the thread will be classified as spam.
	:0 
	* ^Subject: Fwd: 
	{
		# It should not be neccessary to take a copy of the message,
		# but I have not found out how to deal with this otherwise.
		:0 c:
		| formail +1 -ds formail -cx"Message-Id:" >> $spam_id 

		:0 c:
		| formail +1 -ds formail -cx"From:" | sed -e 's/.*\( \|<\)\([[:alnum:]@._-]\+.[A-z]\+\).*/\2/' >> $blacklist

		:0 :
		| formail +1 -ds >> $spam_folder
	}
}


# First we take out mail that is an answer to a mail previously black holed,
# or mail from people on our shitlist. This saves a bit of processing, and is
# very useful on mailing lists where much of the spam is people commenting on
# or replying to spam.
:0
* 1^0 ? formail -cx"References:" -x"In-Reply-To:" | fgrep -is -f $spam_id
* 1^0 ? formail -cx"From" -cx"From:" -cx"Sender:" -cx"X-Envelope-Sender:" | fgrep -is -f $blacklist
{
	:0 i:
	* ^Message-Id:\/.+
	| echo $MATCH >> $spam_id

	:0 hf
	| formail -A"$xspam Follow-up to previously catched spam, or from a blacklisted address."

	:0 :
	$spam_folder
}


# Filter out HTML, and leave a message that it is filtered. I am forever
# grateful to Bart Schaefer for this one. 
:0
* strip_html ?? yes
* ^Content-Type: text/html
* $ $html_content_score^0 
{
	message=`echo -e "$message\n\
$xspam HTML Content.					SpamScore: $="`
	score="$score + $="

        :0 bfW
	| (echo "[html stripped]"; lynx -dump -force_html -stdin)
	
	:0 ahfw
	| formail -i"Content-Type: text/plain" 
}

:0 
* ^Cc:\/.*
* $ MATCH ?? $my_addresses
* $ $cc_to_me^0 
{
	message=`echo -e "$message\n\
$xspam Message not directly to me.			SpamScore: $="`
	score="$score + $="
}
TMP=$MATCH

:0 E
* ^To:\/.*
* $ ! MATCH ?? $my_addresses
* $ $not_to_me^0 
{
	message=`echo -e "$message \n\
$xspam Message not addressed to me.			SpamScore: $="`
	score="$score + $="
}

:0 i
* ^To:\/.*
TMP=| echo "$TMP $MATCH"

# Messages with many recipients are depreciated.
:0
* -5^0 
* 5^1 TMP ?? @
{
	message=`echo -e "$message \n\
$xspam To:/Cc: contain several addresses.		SpamScore: $="`
	score="$score + $="
}

:0 
* ^Content-Type:\/.*
* ! MATCH ?? (text/plain|multipart/signed)
* $ $not_plain_text^0  
{
	message=`echo -e "$message \n\
$xspam Format of message is not plain text.		SpamScore: $="`
	score="$score + $="
}


# If the message has no subject, or it consists entirely of spaces/tabs, it's
# likely spam, and SpamScore is written to the header. Otherwise extract the
# subject but dump any occurrence of Re:/fwd:/SV: in the beginning of the line
# and send it to the second part of this recipe which will check whether the
# extracted part contain any lower-case characters. If not, SpamScore are
# written to the header. Thanks to David W. Tamkin for this technique.
:0 
* $ ! ^Subject:[	 ]*($re_skip_string[	 ]*)+\/.+
* ! ^Subject: \/.+
* $ $no_subject^0 
{
	message=`echo -e "$message \n\
$xspam No subject.					SpamScore: $="`
	score="$score + $="
}

:0 ED
* ! MATCH ?? [a-z]
* $ $all_caps_subject^0 
{
	message=`echo -e "$message \n\
$xspam No lower case characters in subject.		SpamScore: $="`
	score="$score + $="
}

# Give bad marks for more than one consecutive $CRAP_CHARS in Subject (still
# in "MATCH" from last recipe) and for multiple forwards or includes.
:0 
* $ 3^0.7 MATCH ?? $crap_chars[	 ]*$crap_chars
* $ 4^0.6 MATCH ?? $re_skip_string[	 ]*$re_skip_string
{
	message=`echo -e "$message \n\
$xspam Crap characters in subject. 			SpamScore: $="`
	score="$score + $="
}

# Spammers nowadays seem to like to make their subject lines unique by
# appending a number/string in square braquets at the end of their subject
# fields. Oh well, a strong indicator of spam, and easy to spot to boot ;)
:0
* $ $spam_style_uniq_subject^0 MATCH ?? \[[a-z0-9]+\]$
{
	message=`echo -e "$message \n\
$xspam Subject has spam-style unique header.	 	SpamScore: $="`
	score="$score + $="
}

:0 
* $ $crap_domain_score^0 $crap_domains
{
	message=`echo -e "$message \n\
$xspam Header has crap domain in it.			SpamScore: $="`
	score="$score + $="
}

:0 HB
* charset=()\/.+
* $ ! MATCH ?? $ok_charsets
* $ $bad_charset_score^0 
{
	message=`echo -e "$message \n\
$xspam Reference to depreciated charset.		SpamScore: $="`
	score="$score + $="
}

# Do some checks in the body as well, but only if it is in an understandable
# format. 
:0 B
* ! ^Content-Disposition: attachment
* 2^1 $ $crap_chars_body[	 ]*$crap_chars_body[	 ]*$crap_chars_body
{
	message=`echo -e "$message \n\
$xspam Adjacent crap characters in body.		SpamScore: $="`
	score="$score + $="
}

:0 B
* $ $quote_ratio^1 ^>
* -1^1 ^[^>]
{
	message=`echo -e "$message \n\
$xspam Quoted lines exceeds 1:$quote_ratio limit.			SpamScore: $quote_score"`
	score="$score + $quote_score"
}

# Give bad marks for mentioning negative words.
:0 
* $ $negative_words_score^2 $negative_words
* $ B $negative_words_score^1 $negative_words
* 5^0
* -5^1 > 1000
{
	message=`echo -e "$message \n\
$xspam Hits for negative words.			SpamScore: $="`
	score="$score + $="
}

# I know where to get porn. Don't bother me with your crap.
:0 
* $ $porn^2 $porn
* $ B $porn^1 $porn
* 5^0
* -5^1 > 1000
{
	message=`echo -e "$message \n\
$xspam Hits for porn.					SpamScore: $="`
	score="$score + $="
}

:0 B
* $ $positive_words_score^1 $positive_words
* -5^1 > 1000
{
	message=`echo -e "$message \n\
$xspam BODY: Positive words hit.			Score:    -$="`
	score="$score - $="
}

score=`echo $(( $score ));`
# Check whether we went over the limit, and put a report in if we did.
:0 
* $ $score^0
* $ -$limit^0
{
	message=`echo -e "$message \n\
$xspam Total SpamScore ($score) exceeded limit ($limit) by $= points."`

	:0 f:
	| formail -A"$message"

	# If the message is spam, then record its message id so we can plunk
	# follow-ups to it.
	:0 i:
	* ^Message-Id:\/.+
	DUMMY=| echo $MATCH >> $spam_id

	# Finally, deliver the message in the spam folder.
	:0 a:
	* ! ? formail -cx"From:" | fgrep -is -f $whitelist
	$spam_folder

	:0 Ef
	| formail -A "$xspam Message indicates spam, but sender is on whitelist."
}

:0 fw
| formail -A"$message"



Reply to: