[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

How to improve on this Python code. Was: Re: OT: Why is C so popular?



Ron Johnson wrote:

But, of course, that's not an issue in The Clearly Superior Language,
is it?
Ok, if this thread has accomplished little else, it seems to have gotten a couple people, including myself to play around with Python.

I have a simple little perl program at work. It parses a mailbox format file and extracts email addresses from it. The mailbox file is a bunch of bounced email notifications. The program uses a couple regular expressions to extract the data and a hash to hold a unique list of the emails. It then connects to a database to see if those people have already been flagged as having bad emails.

I made a copy of this script, removed the database stuff, and stripped it down to the basic process of read the file, find emails excluding some admin type addresses and then print out the unique bounced addresses. Then I wrote a Python version. It didn't take very long to code, about the same time to write the perl code if I take out the time reading howto's and the language reference.

$ wc bademail.mai
  8414   29296  419796 bademail.mai

Perl: parseContacts.pl
----------------------------------------------------------
my ($unique_emails,$total_emails);
$unique_emails = 0;
$total_emails = 0;

my %badHash;

my $filename = 'bademail.mai';
open(BADEML,$filename) or die "Error opening $filename";
while($isline=<BADEML>)
{

       if($isline =~ /([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)/) {
               $eml = lc($1);
               next if($eml =~ /(postmaster|mailer-daemon)/i);
               $badHash{$eml} = 'bad';
               $total_emails++;
       }

}

close(BADEML);

foreach $bademail (sort keys(%badHash))
{
               print "$bademail, $badHash{$bademail}\n";
               $unique_emails++;
}

print "$filename: total bounced emails ($total_emails); unique ($unique_emails)\
n";
--------------------------------
$ time perl parseContacts.pl
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)

real    0m0.056s
user    0m0.051s
sys     0m0.004s

Python: parseContacts.py
-------------------------------
import sys, re, string

unique_emails = 0
total_emails = 0
badHash = {}
file = "bademail.mai"
try:
       fi = open(file,"r")
       print "Tryied to open "+file
except (IOError, OSError):
       print "Trouble opening "+file
       sys.exit(0)


email_pattern = re.compile("([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)")
ignore_pattern = re.compile("(postmaster|mailer-daemon)",re.I)

for     line in fi.readlines():
       email_mo = email_pattern.search(line)
if email_mo: ignore_mo = ignore_pattern.search(email_mo.group(1))
               if not ignore_mo:
                       badHash[string.lower(email_mo.group(1))] = "bad"
                       total_emails = total_emails + 1

fi.close();

for badmail in badHash.keys():
       print "%s, %s" % (badmail, badHash[badmail])
       unique_emails = unique_emails + 1

print "%s: total bounced emails (%d); unique (%d)"% \
       (file, total_emails, unique_emails)

--------------------------------------
$ time python parseContacts.py
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)

real    0m0.839s
user    0m0.818s
sys     0m0.008s

I was able to write both versions and they find the same emails. I guess I'll have to look at both scripts in six months to see which one I can still understand. I purposely left out all commenting I would normally do to see how well the code stood on it's own (but that is _very_ against my normal practice.)

What did I do wrong to make the python code take over ten times as long?

I thought the compiled regexp patterns were suppose to be an advantage over non compiled ones (although I believe Perl caches the compiled regexp if it's contents don't change).

Is string.tolower() slower than perl's lc()?

Am I using an inferior method of reading a line of text from the file? (I'm instructing Perl to read one line at a time as well, and not gobble the whole thing in one go.)

If I cut the file down a bit the results are still over 10x apart:
$ wc bademail.mai
  4725   11919  282057 bademail.mai
Perl:
bademail.mai: total bounced emails (313); unique (68)

real    0m0.030s
user    0m0.025s
sys     0m0.006s
Python:
bademail.mai: total bounced emails (313); unique (68)

real    0m0.532s
user    0m0.521s
sys     0m0.010s

$ wc bademail.mai
    58     282    2099 bademail.mai
Perl:
bademail.mai: total bounced emails (6); unique (2)

real    0m0.009s
user    0m0.004s
sys     0m0.006s
Python:
bademail.mai: total bounced emails (6); unique (2)

real    0m0.044s
user    0m0.035s
sys     0m0.008s


I thought all that talk about speeding up python using those other programs or whatever they were was to get it on par with compiled code like C. Do I need to use them to get my Python code to run as fast as my Perl code, or am I doing something grossly incorrect? I used python, python1.5, python2 and python2.2 and they gave practically the same results.





Reply to: