[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Tip for seeding jwhois -- caching whois client



I'm making extensive whois queries in generating spam reports (everyone
needs a hobby, right...).  Which...is...slow....

...so I was excited to discover "jwhois", a caching whois client.  This
creates a cache (/var/cache/jwhois/jwhois.db) for previously requested
domain.  In spam lookups this is convenient as 100 domains accounts for
well over half my spam (1403 total domains recorded).

Problem is:  jwhois only caches lookups where it already knows the
server.

The trick then, is to seed the cache.  I'd already performed a host
lookup on some 5000+ spams I've received (since early November! -- and
yes, caching DNS helps tons), so the following does the trick:


Assuming domains in /tmp/spamdomains-ranked, in the following format
(modify recipie to suit):

------------------------------------------------------------------------
     1	    345 kornet.net
     2	    156 freeserve.com
     3	    148 comcast.net
     4	    138 rr.com
     5	    132 guangzhou.gd.cn
     6	    107 uu.net
     7	    104 attbi.com
     8	     95 dacom.co.kr
     9	     67 pacbell.net
    10	     64 wanadoo.fr
------------------------------------------------------------------------

    for dom in $( 
        # Extract domains from list, get rid of any numeric IPs which
        # have snuck through.
        awk '{print $3}' /tmp/spamdomains-ranked |
            sed -e '/^[0-9]\{1,3\}\.[0-9]\{1,3\}/d' 
        )
    do 
        echo -e "\n>>> $dom <<<"
            # Recursive query.  Query the second time, using the
            # WHOIS server indicated by the first pull 
            jwhois -h $( 
                jwhois -h whois.internic.net $dom |
                head | 
                grep '^\[' | 
                tail -1 | 
                sed -e 's/[][]//g' -e 's/^$/whois.internic.net/' 
            ) $dom |
            head -2  ; 
    done;


...that's a serial query, which can bog down on timeouts for any given
domain.  To speed processing, batch reqeusts, e.g.:

    step=40 # Number of requests to batch in simultaneous submits
    for s in $( 
        seq 1 $step $( wc -l /tmp/spamdomains-ranked | awk '{print $1}' )
        )
    do 
        e=$(( s + step - 1 ))
        echo "e: $e"
        for dom in $(
            awk '{print $3}' /tmp/spamdomains-ranked |
                sed -e '/^[0-9]\{1,3\}\.[0-9]\{1,3\}/d' |
                sed -ne "${s},${e}p" 
        )
        do 
            echo -e "\n>>> $dom <<<"
            jwhois -h $(
                jwhois -h whois.internic.net $dom |
                    head |
                    grep '^\[' |
                    tail -1 |
                    sed -e 's/[][]//g' -e 's/^$/whois.internic.net/' 
            ) $dom | head -2  
        done & wait; 
    done

Alternatively, sleep for 5-20 seconds between batches rather than
'wait'ing.


What I don't have is a way to periodically repeat this seeding, which
would be useful, though using the recursive lookup in scripts could
satisfy most needs.


Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    A: Because it messes up the order in which people normally read text.
    Q: Why is top-posting such a bad thing?
    A: Top-posting.
    Q: What is the most annoying thing on usenet and in e-mail?

Attachment: signature.asc
Description: Digital signature


Reply to: