Spam, ASNs, CIDRs, and d-u (was Re: spam from chinanet)

To: debian-devel@lists.debian.org
Subject: Spam, ASNs, CIDRs, and d-u (was Re: spam from chinanet)
From: "Karsten M. Self" <kmself@ix.netcom.com>
Date: Tue, 28 Sep 2004 02:17:18 -0700
Message-id: <[🔎] 20040928091717.GB28459@localhost>
Mail-followup-to: debian-devel@lists.debian.org
In-reply-to: <[🔎] 20040915054507.GA16644@cirrus.madduck.net>
References: <[🔎] 20040915054507.GA16644@cirrus.madduck.net>

on Wed, Sep 15, 2004 at 07:45:07AM +0200, martin f krafft (madduck@debian.org) wrote:
> Dear developers,
> 
> Over the past 2 months, about 80 or 90% of the spam I received
> through @d.o came from the networks of Chinanet. I have reported
> every issue, but they never responded, nor are they taking counter
> measures. Some of the spammer's IPs have remained constant. This
> suggests to me that they are spammer-cooperative (or generally
> incompetent).
> 
> May I suggest that we block Chinanet? Their subnets are
> 
> 222.64.0.0/13
> 222.72.0.0/15
> 202.108.181.0/24
> 221.224.0.0/13
> 218.78.0.0/15
> 218.80.0.0/14

See "Incidentally" below for more on these specific netblocks.

> Or you can use rbl.madduck.net, which filters them. I think we could
> potentially cut a lot of spam by blocking these IPs.

In my typical late-to-the party fashion, a few additional comments on
this topic.  Some of which I've previously discussed with Martin
off-list.

I've been working with ASN and CIDR data associated with spam received
via my ISP account.  While the specific findings I've got may be
interesting, the methods are of more general use.

Short answer:  you can classify incoming mail using its IP into its
network of origin, with a DNS query.

Background:  ASN identifies the Autonomous System.  Effectively, these
are the networks the Internet is networking between.  Each is defined by
a single span of routing authorities, peers, etc., and largely,
organizational authority.  In other words:  you've got an identifiable,
accountable entity with a definable network space.  More to the point:
they're _accountable_ for that space, and had damned well better be
keeping it clean.

By getting the ASN associated with an IP and tracking same for spam
received, it's pretty easy to find out where the bulk of spam is coming
from.  My stats _don't_ reflect where ham is originating from, so I'm
getting raw volume, but not ratio, data here.  This could be tacked on
to a better-developed system.

There's a very strong power relationship in what ASNs contribute what
proportion of spam.  Over the past nine months:

  - A single ASN has contributed ~15% (12-17%) of all spam I receive.

  - 4-5 ASNs account for a quarter of all spam.

  - 20-30 ASNs account for half of all spam. 

I track results at:

    http://linuxmafia.com/~karsten/monthly-asn-report-current.txt

...as well as history.  See my homepage for details.

Working with ~20 days' spam, I get the following breakout for the top 20
ASNs (the report linked above provides additional details such as name
of the network).  This is based on a total of 7093 spams, and includes
817 ASNs.  There are ~24k assigned ASNs total.

     1     1099 ASN-4766
     2      347 ASN-4134
     3      263 ASN-9105
     4      256 ASN-9277
     5      134 ASN-4814
     6      122 ASN-4837
     7      114 ASN-3352
     8      111 ASN-12076
     9       97 ASN-18747
    10       93 ASN-11908
    11       82 ASN-7132
    12       81 ASN-9924
    13       80 ASN-7418
    14       78 ASN-6939
    15       78 ASN-3269
    16       77 ASN-3786
    17       75 ASN-8346
    18       69 ASN-;;
    19       68 ASN-4713
    20       54 ASN-3462

    These include:  KORNET, China Telecom, tiscali-uk, thrunet, chna169,
    CNCGROUP (China), China Network Communications, TDE (Spain), MSN,
    IFX, Verestar, SBC, Taiwan Fixed Network, Hurricane Electric,
    Telecom Italia, DACOM (Korea), Sonatel (Senegal), NTT-OCNET,
    Chunghwa Telecom (China).

For CIDR my data show the top 20 being the following.  

     1      388 222.96.0.0/12
     2      259 212.74.96.0/19
     3      256 221.144.0.0/12
     4      200 61.72.0.0/13
     5       93 64.4.0.0/18
     6       93 220.120.0.0/13
     7       90 61.254.0.0/15
     8       90 195.166.237.0/24
     9       81 200.73.64.0/19
    10       70 213.154.64.0/19
    11       67 connection/timed
    12       63 61.31.128.0/19
    13       61 64.71.128.0/18
    14       51 165.165.0.0/16
    15       44 213.215.128.0/18
    16       43 192.118.68.0/22
    17       42 211.36.160.0/19
    18       41 211.110.0.0/16
    19       37 212.216.128.0/17
    20       35 80.88.128.0/20

    These include:  KORnet, Tiscali, KORnet again several times,
    Hotmail, etc. ('whois' on the IP will give you this):

All well and good.  How's it work?

Simple:

    host -t txt <reversed ip>.asn.routeviews.org

...returns the ASN and CIDR for a given IP in parseable format as a DNS
query.

E.g.:

    host murphy.debian.org
    murphy.debian.org has address 146.82.138.6
    $ host -t txt 6.138.82.146.asn.routeviews.org
    6.138.82.146.asn.routeviews.org text "27354" "146.82.136.0" "21"

So, that's AS27354, with CIDR 146.82.136.0/21.

A subsequent 'whois AS27354' will tell you that this is LayerOne
Holdings, Inc.

For more general information:

    http://www.routeviews.org/

The data are compiled directly from BGP router maps.  My understanding
is that the zonefiles are downloadable (I'm checking on this now).
They're certainly cacheable.

More to the point:  the data are available at SMTP time.  The one bit of
data you've got is your SMTP peer's IP.  It really doesn't matter if
this is the point of origin of the spam or just an upstream relay.  If
you know you're getting bad traffic from this network (ASN or CIDR), you
can take appropriate action[1].

It's also possible, as I've done, to look at volumes of spam by ASN or
CIDR.  Better, as I indicated, would be *ratios*.  A peer with a very
high ham (non-spam) ratio, which has a spam volume that on an absolute
scale is high, but proportionate to total traffic is middlin', might be
allowed through.  Incidentally, the ratio data should fall out of your
Bayes classifier token database if you know how to parse it.

Because the data can be encoded into firewall rules, it's possible to
reduce mail filtering load by offloading this to your iptables rules.
Any mail (or optionally:  all) packets from highly hostile networks can
be blocked.  Or rate limiting can be applied.  I'm particularly fond of
the idea of rejecting packets from a network in proportion to its
spam:ham ratio....

If you're not comfortable blocking by ASN, CIDR data give a slightly
finer level of control.  Even for a particularly standout bad net such
as KORNET, there are CIDRs which are markedly worse than others, from a
total volume perspective.

The other nice thing about this is you can base filtering on your _own_,
_current_ experience, and that relatively small sampling systmes
generate useful statistics.  Say, based on spam volumes for the current
and prior fortnight or month.  Tracking historical data too far back
will result in previously clean nets being able to slide for a while.
Keeping only relatively current data avoids this problem (and will
probably be the subject of tuning arguments for years to come).

My experience is that the inhabitants of the top five or so spots tend
to remain in place for at least a few months at a time, particularly the
leader (KORnet in my experience), though over the course of nine months
or so I've seen considerable shifts in and out of the 2-5 positions.

Another point is that for many stable email communities, the set of ASNs
and/or CIDRs which correspond frequently is relatively small.  For a set
of 850 recent emails to this list, there are 234 distinct IPs, and 148
ASNs.  Half of the volume was accounted for by 18 ASNs.

Of the top-20 spamming ASNs the following appear in the d-u posts
analyzed.  "Freq" is frequency of occurence in the d-u sample.  "Spam %"
and "Spam Rank" are the percent contribution of these ASNs to my total
spam load, and the ranking in total spam received, of these networks.

    Freq  ASN    Spam %  Spam Rank  Name
    ----  ---    ------  ---------  -------------------------------
       10 3352     1.4%          9  Internet Access Network of TDE
        2 8220     1.0%         16  COLT Telecommunications - www.colt.net
        1 3269     2.5%          5  TELECOM ITALIA

It would be helpful to run an analysis over a larger corpus of list
posts, but from the look of it, in the neighborhood of a quarter of spam
could be eliminated from d-u with a 0.12% false positive rate.  More
selective filtering (say, CIDR rather than ASN) of less aggregiously
spammy networks, and rate throttling rather than outright rejection,
might balance mail filtering with allowing legitimate mail through.

Incidentally, of the IP ranges Martin proposes blocking, my own
experience shows:

Rank   Cum %   Pct  Spams  ASN     Description
----  ------   ---- -----  -----   -------------

> 222.64.0.0/13

 181   80.0%   0.1%    15  4812    China Telecom (Group)

> 222.72.0.0/15

Not assigned (possible bogon?)

> 202.108.181.0/24

 492   92.8%   0.0%     3  4808    Chinanet Beijing Site AS

> 221.224.0.0/13

   2   17.5%   4.1%   689  4134    China Telecom

> 218.78.0.0/15

 181   80.0%   0.1%    15  4812    China Telecom (Group)

> 218.80.0.0/14

 181   80.0%   0.1%    15  4812    China Telecom (Group)

...so at least in my experience, only 221.224.0.0/13 is a high
contributor, which might reduce the false positive rate significantly.

Of course, YMMV.

Peace.

--------------------
Notes:

1.  If you want to use ASN in your procmail scripts, or to create a
    token which SpamAssassin and other Bayesian classfiers will
    automatically use, you can refer to my ASN procmail header creation
    rule here:

    http://linuxmafia.com/~karsten/Download/procmail-asn-header

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    Kerry / Edwards '04                              http://www.johnkerry.com/

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: Spam, ASNs, CIDRs, and d-u
  - From: Florian Weimer <fw@deneb.enyo.de>
- Re: Spam, ASNs, CIDRs, and d-u (was Re: spam from chinanet)
  - From: "Arne Götje (高盛華)" <20030910antispam@gmx.net>

References:
- spam from chinanet
  - From: martin f krafft <madduck@debian.org>

Prev by Date: Re: Updating scanners and filters in Debian stable (3.1)
Next by Date: Re: Frank Carmickle and Marco Paganini must die
Previous by thread: Re: spam from chinanet
Next by thread: Re: Spam, ASNs, CIDRs, and d-u
Index(es):
- Date
- Thread