How to improve on this Python code. Was: Re: OT: Why is C so popular?
Ron Johnson wrote:
But, of course, that's not an issue in The Clearly Superior Language,
is it?
Ok, if this thread has accomplished little else, it seems to have gotten
a couple people, including myself to play around with Python.
I have a simple little perl program at work. It parses a mailbox format
file and extracts email addresses from it. The mailbox file is a bunch
of bounced email notifications. The program uses a couple regular
expressions to extract the data and a hash to hold a unique list of the
emails. It then connects to a database to see if those people have
already been flagged as having bad emails.
I made a copy of this script, removed the database stuff, and stripped
it down to the basic process of read the file, find emails excluding
some admin type addresses and then print out the unique bounced
addresses. Then I wrote a Python version. It didn't take very long to
code, about the same time to write the perl code if I take out the time
reading howto's and the language reference.
$ wc bademail.mai
8414 29296 419796 bademail.mai
Perl: parseContacts.pl
----------------------------------------------------------
my ($unique_emails,$total_emails);
$unique_emails = 0;
$total_emails = 0;
my %badHash;
my $filename = 'bademail.mai';
open(BADEML,$filename) or die "Error opening $filename";
while($isline=<BADEML>)
{
if($isline =~ /([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)/) {
$eml = lc($1);
next if($eml =~ /(postmaster|mailer-daemon)/i);
$badHash{$eml} = 'bad';
$total_emails++;
}
}
close(BADEML);
foreach $bademail (sort keys(%badHash))
{
print "$bademail, $badHash{$bademail}\n";
$unique_emails++;
}
print "$filename: total bounced emails ($total_emails); unique
($unique_emails)\
n";
--------------------------------
$ time perl parseContacts.pl
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)
real 0m0.056s
user 0m0.051s
sys 0m0.004s
Python: parseContacts.py
-------------------------------
import sys, re, string
unique_emails = 0
total_emails = 0
badHash = {}
file = "bademail.mai"
try:
fi = open(file,"r")
print "Tryied to open "+file
except (IOError, OSError):
print "Trouble opening "+file
sys.exit(0)
email_pattern = re.compile("([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)")
ignore_pattern = re.compile("(postmaster|mailer-daemon)",re.I)
for line in fi.readlines():
email_mo = email_pattern.search(line)
if email_mo: ignore_mo =
ignore_pattern.search(email_mo.group(1))
if not ignore_mo:
badHash[string.lower(email_mo.group(1))] = "bad"
total_emails = total_emails + 1
fi.close();
for badmail in badHash.keys():
print "%s, %s" % (badmail, badHash[badmail])
unique_emails = unique_emails + 1
print "%s: total bounced emails (%d); unique (%d)"% \
(file, total_emails, unique_emails)
--------------------------------------
$ time python parseContacts.py
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)
real 0m0.839s
user 0m0.818s
sys 0m0.008s
I was able to write both versions and they find the same emails. I guess
I'll have to look at both scripts in six months to see which one I can
still understand. I purposely left out all commenting I would normally
do to see how well the code stood on it's own (but that is _very_
against my normal practice.)
What did I do wrong to make the python code take over ten times as long?
I thought the compiled regexp patterns were suppose to be an advantage
over non compiled ones (although I believe Perl caches the compiled
regexp if it's contents don't change).
Is string.tolower() slower than perl's lc()?
Am I using an inferior method of reading a line of text from the file?
(I'm instructing Perl to read one line at a time as well, and not gobble
the whole thing in one go.)
If I cut the file down a bit the results are still over 10x apart:
$ wc bademail.mai
4725 11919 282057 bademail.mai
Perl:
bademail.mai: total bounced emails (313); unique (68)
real 0m0.030s
user 0m0.025s
sys 0m0.006s
Python:
bademail.mai: total bounced emails (313); unique (68)
real 0m0.532s
user 0m0.521s
sys 0m0.010s
$ wc bademail.mai
58 282 2099 bademail.mai
Perl:
bademail.mai: total bounced emails (6); unique (2)
real 0m0.009s
user 0m0.004s
sys 0m0.006s
Python:
bademail.mai: total bounced emails (6); unique (2)
real 0m0.044s
user 0m0.035s
sys 0m0.008s
I thought all that talk about speeding up python using those other
programs or whatever they were was to get it on par with compiled code
like C. Do I need to use them to get my Python code to run as fast as my
Perl code, or am I doing something grossly incorrect? I used python,
python1.5, python2 and python2.2 and they gave practically the same results.
Reply to: