How to improve on this Python code. Was: Re: OT: Why is C so popular?

To: Debian-User <debian-user@lists.debian.org>
Subject: How to improve on this Python code. Was: Re: OT: Why is C so popular?
From: Jacob Anawalt <jacob@cachevalley.com>
Date: Sat, 30 Aug 2003 00:21:37 -0600
Message-id: <[🔎] 3F504271.6020301@cachevalley.com>
In-reply-to: <[🔎] 1062192851.667.148.camel@haggis>
References: <[🔎] 1061964668.3339.76.camel@Thief> <[🔎] 20030827112105.GC22538@ursine.ca> <[🔎] 20030827110636.0272187a.grey@dmiyu.org> <[🔎] 1062018108.9384.187.camel@Thief> <[🔎] 20030827155929.3102dd81.grey@dmiyu.org> <[🔎] 20030828030303.GA29386@turing.cs.camosun.bc.ca> <[🔎] 1062041710.25752.90.camel@haggis> <[🔎] 3F4D98A3.9050507@cachevalley.com> <[🔎] 20030827235523.4b128cfc.grey@dmiyu.org> <[🔎] 1062062293.11099.140.camel@Thief> <[🔎] 20030828024105.368471c3.grey@dmiyu.org> <[🔎] 1062064865.11718.5.camel@Thief> <[🔎] 1062076955.29296.1367.camel@flmrroach> <[🔎] 1062081325.3f4e132d121c1@impt3-1.free.fr> <[🔎] 20030828125036.223c28ec.grey@dmiyu.org> <[🔎] 1062125648.17121.4.camel@Thief> <[🔎] 1062163150.668.25.camel@haggis> <[🔎] 20030829063008.6cfb90ee.grey@dmiyu.org> <[🔎] 1062179254.667.91.camel@haggis> <[🔎] 20030829123851.07fdb0dd.grey@dmiyu.org> <[🔎] 1062192851.667.148.camel@haggis>

Ron Johnson wrote:

But, of course, that's not an issue in The Clearly Superior Language,
is it?

Ok, if this thread has accomplished little else, it seems to have gottena couple people, including myself to play around with Python.

I have a simple little perl program at work. It parses a mailbox formatfile and extracts email addresses from it. The mailbox file is a bunchof bounced email notifications. The program uses a couple regularexpressions to extract the data and a hash to hold a unique list of theemails. It then connects to a database to see if those people havealready been flagged as having bad emails.

I made a copy of this script, removed the database stuff, and strippedit down to the basic process of read the file, find emails excludingsome admin type addresses and then print out the unique bouncedaddresses. Then I wrote a Python version. It didn't take very long tocode, about the same time to write the perl code if I take out the timereading howto's and the language reference.


$ wc bademail.mai
  8414   29296  419796 bademail.mai

Perl: parseContacts.pl
----------------------------------------------------------
my ($unique_emails,$total_emails);
$unique_emails = 0;
$total_emails = 0;

my %badHash;

my $filename = 'bademail.mai';
open(BADEML,$filename) or die "Error opening $filename";
while($isline=<BADEML>)
{

       if($isline =~ /([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)/) {
               $eml = lc($1);
               next if($eml =~ /(postmaster|mailer-daemon)/i);
               $badHash{$eml} = 'bad';
               $total_emails++;
       }

}

close(BADEML);

foreach $bademail (sort keys(%badHash))
{
               print "$bademail, $badHash{$bademail}\n";
               $unique_emails++;
}

print "$filename: total bounced emails ($total_emails); unique($unique_emails)\

n";
--------------------------------
$ time perl parseContacts.pl
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)

real    0m0.056s
user    0m0.051s
sys     0m0.004s

Python: parseContacts.py
-------------------------------
import sys, re, string

unique_emails = 0
total_emails = 0
badHash = {}
file = "bademail.mai"
try:
       fi = open(file,"r")
       print "Tryied to open "+file
except (IOError, OSError):
       print "Trouble opening "+file
       sys.exit(0)


email_pattern = re.compile("([a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+)")
ignore_pattern = re.compile("(postmaster|mailer-daemon)",re.I)

for     line in fi.readlines():
       email_mo = email_pattern.search(line)

if email_mo: ignore_mo =ignore_pattern.search(email_mo.group(1))

               if not ignore_mo:
                       badHash[string.lower(email_mo.group(1))] = "bad"
                       total_emails = total_emails + 1

fi.close();

for badmail in badHash.keys():
       print "%s, %s" % (badmail, badHash[badmail])
       unique_emails = unique_emails + 1

print "%s: total bounced emails (%d); unique (%d)"% \
       (file, total_emails, unique_emails)

--------------------------------------
$ time python parseContacts.py
[...snip print email statements...]
bademail.mai: total bounced emails (888); unique (217)

real    0m0.839s
user    0m0.818s
sys     0m0.008s

I was able to write both versions and they find the same emails. I guessI'll have to look at both scripts in six months to see which one I canstill understand. I purposely left out all commenting I would normallydo to see how well the code stood on it's own (but that is _very_against my normal practice.)


What did I do wrong to make the python code take over ten times as long?

I thought the compiled regexp patterns were suppose to be an advantageover non compiled ones (although I believe Perl caches the compiledregexp if it's contents don't change).


Is string.tolower() slower than perl's lc()?

Am I using an inferior method of reading a line of text from the file?(I'm instructing Perl to read one line at a time as well, and not gobblethe whole thing in one go.)


If I cut the file down a bit the results are still over 10x apart:
$ wc bademail.mai
  4725   11919  282057 bademail.mai
Perl:
bademail.mai: total bounced emails (313); unique (68)

real    0m0.030s
user    0m0.025s
sys     0m0.006s
Python:
bademail.mai: total bounced emails (313); unique (68)

real    0m0.532s
user    0m0.521s
sys     0m0.010s

$ wc bademail.mai
    58     282    2099 bademail.mai
Perl:
bademail.mai: total bounced emails (6); unique (2)

real    0m0.009s
user    0m0.004s
sys     0m0.006s
Python:
bademail.mai: total bounced emails (6); unique (2)

real    0m0.044s
user    0m0.035s
sys     0m0.008s

I thought all that talk about speeding up python using those otherprograms or whatever they were was to get it on par with compiled codelike C. Do I need to use them to get my Python code to run as fast as myPerl code, or am I doing something grossly incorrect? I used python,python1.5, python2 and python2.2 and they gave practically the same results.

Reply to:

Follow-Ups:
- Re: How to improve on this Python code. Was: Re: OT: Why is C so popular?
  - From: Mark Roach <mrroach@okmaybe.com>

References:
- OT: Why is C so popular?
  - From: Alex Malinovich <demonbane@the-love-shack.net>
- Re: OT: Why is C so popular?
  - From: Paul Johnson <baloo@ursine.ca>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Alex Malinovich <demonbane@the-love-shack.net>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Deryk Barker <dbarker@turing.cs.camosun.bc.ca>
- Re: OT: Why is C so popular?
  - From: Ron Johnson <ron.l.johnson@cox.net>
- Re: OT: Why is C so popular?
  - From: Jacob Anawalt <jacob@cachevalley.com>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Alex Malinovich <demonbane@the-love-shack.net>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Alex Malinovich <demonbane@the-love-shack.net>
- Re: OT: Why is C so popular?
  - From: Mark Roach <mrroach@okmaybe.com>
- Re: OT: Why is C so popular?
  - From: Francois Bottin <fbottin@free.fr>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Alex Malinovich <demonbane@the-love-shack.net>
- Re: OT: Why is C so popular?
  - From: Ron Johnson <ron.l.johnson@cox.net>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Ron Johnson <ron.l.johnson@cox.net>
- Re: OT: Why is C so popular?
  - From: Steve Lamb <grey@dmiyu.org>
- Re: OT: Why is C so popular?
  - From: Ron Johnson <ron.l.johnson@cox.net>

Prev by Date: Re: Gnome icons misbehaving
Next by Date: Re: open office spell checking
Previous by thread: Re: OT: Why is C so popular?
Next by thread: Re: How to improve on this Python code. Was: Re: OT: Why is C so popular?
Index(es):
- Date
- Thread