Re: How to improve on this Python code. Was: Re: OT: Why is C so popular?

To: Debian-User <debian-user@lists.debian.org>
Subject: Re: How to improve on this Python code. Was: Re: OT: Why is C so popular?
From: Jacob Anawalt <jacob@cachevalley.com>
Date: Mon, 01 Sep 2003 15:24:40 -0600
Message-id: <[🔎] 3F53B918.2020902@cachevalley.com>
In-reply-to: <1062357048.29282.2971.camel@flmrroach>
References: <1061964668.3339.76.camel@Thief> <20030827112105.GC22538@ursine.ca> <20030827110636.0272187a.grey@dmiyu.org> <1062018108.9384.187.camel@Thief> <20030827155929.3102dd81.grey@dmiyu.org> <20030828030303.GA29386@turing.cs.camosun.bc.ca> <1062041710.25752.90.camel@haggis> <3F4D98A3.9050507@cachevalley.com> <20030827235523.4b128cfc.grey@dmiyu.org> <1062062293.11099.140.camel@Thief> <20030828024105.368471c3.grey@dmiyu.org> <1062064865.11718.5.camel@Thief> <1062076955.29296.1367.camel@flmrroach> <1062081325.3f4e132d121c1@impt3-1.free.fr> <20030828125036.223c28ec.grey@dmiyu.org> <1062125648.17121.4.camel@Thief> <1062163150.668.25.camel@haggis> <20030829063008.6cfb90ee.grey@dmiyu.org> <1062179254.667.91.camel@haggis> <20030829123851.07fdb0dd.grey@dmiyu.org> <1062192851.667.148.camel@haggis> <3F504271.6020301@cachevalley.com> <1062357048.29282.2971.camel@flmrroach>

Mark Roach wrote:

On Sat, 2003-08-30 at 02:21, Jacob Anawalt wrote:
Ron Johnson wrote:
But, of course, that's not an issue in The Clearly Superior Language,
is it?
Ok, if this thread has accomplished little else, it seems to have gottena couple people, including myself to play around with Python.
I have a simple little perl program at work. It parses a mailbox formatfile and extracts email addresses from it. The mailbox file is a bunchof bounced email notifications. The program uses a couple regularexpressions to extract the data and a hash to hold a unique list of theemails. It then connects to a database to see if those people havealready been flagged as having bad emails.
I made a copy of this script, removed the database stuff, and strippedit down to the basic process of read the file, find emails excludingsome admin type addresses and then print out the unique bouncedaddresses. Then I wrote a Python version. It didn't take very long tocode, about the same time to write the perl code if I take out the timereading howto's and the language reference.
[snip]
real    0m0.056s
user    0m0.051s
sys     0m0.004s
[snip]
real    0m0.839s
user    0m0.818s
sys     0m0.008s
[snip]
What did I do wrong to make the python code take over ten times as long?
Sure, use perl's #1 optimized, built-in feature for your test case ;-)

I didn't realize I was until I got some emails about it :)

I wasn't trying to do a performance test, just a readability. I likedthat they were about the same length, and I'll see how well I understandthem when I look at them in January. I was suprised by the difference inspeed doing what I do all the time with my Perl code so I wanted to askabout what I could have done better. I think that question has beenanswered here and in a few off-list emails.


I think, that in this case it is probably safe to say "who cares?" do
you really care about .06 seconds vs .8 seconds? is your real data large
enough for the difference to matter?. (if it is, btw, you might want to
use a while loop and 'line = fi.readline()' instead of putting the whole
file in memory)

The code "for line in fi.readlines()" reads the whole file into a tupleor something instead of iterating lines? Well that was a big woops.

Sometimes I have 100MB mail files, so 10x as long is a difference (and Idon't want to slurp the thing into memory - thanks for pointing out thatI was).

On the other hand I don't process these every day and can do somethingelse while it's running. In any case, I'm trying to find where I mightuse Python from day to day. I'll try CGI next, especially some longerand more complex work where I normaly write a few Perl modules. I don'tdo GUI work on Linux very often, so I probably won't dabble in that.


No one has said (that I know of) that python is the fastest language
ever. The only thing that I have heard is that it is "fast enough" and
that the benefits outweigh the addition (.8 seconds of) run time. In
some cases it might not, you might not like it, you might have code that
already works just fine, right tool for the right job and all that.

in terms of a very simple optimization though, try adding this just
before the "email_mo = email_pattern.search(line)" line:

if not '@' in line:
   continue

this prevents every line from having regex used when there is no chance
of it matching. It also cut the runtime for my test case (a 1.3 meg
mailbox) from .380 to .117

So my mistake was using regexp instead of find to ignore ignore linesthat didn't have email addresses in them. Someone else also pointed outusing find instead of the regex to limit the matches.


or if you want to really cheat, replace you file open and main for loop
with this:
"""
lines = os.popen(r"egrep '[a-zA-Z_\.\-]+\@[a-zA-Z\.\-]+' bademail.mai").read()

matches = email_pattern.findall(lines)
for match in matches:
   if not ignore_pattern.search(match):
       badHash[match.lower()] = 'bad'
"""
yes, it _is_ evil, but it works ;-)

real    0m0.069s
user    0m0.060s
sys     0m0.010s


-Mark

Well, that's a way to limit the lines matched :) I'm not sure what therest of the code does exactly, but if it turns:


The email address Jacob <jacob@CacheValley.com> was not valid.

Into:

"jacob@cachevalley.com" stored in match, then it's a solution I'll haveto keep in mind if I decide to try Python for another file search anddatabase script.


Thanks,
Jacob

Reply to:

Prev by Date: Re: Duplicating installs across the network
Next by Date: Re: re wet blue
Previous by thread: Re: /opt/kernel, depmod
Next by thread: [OT] Why does X need so much CPU power?
Index(es):
- Date
- Thread