[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: quick scripting question - finding occurrence in many lines



On Sun, Nov 05, 2006 at 05:21:23PM +1100, John O'Hagan wrote:
> On Sunday 05 November 2006 16:42, John O'Hagan wrote:
> > On Sunday 05 November 2006 09:03, Ken Irving wrote:
> > > On Fri, Nov 03, 2006 at 09:56:12PM -0500, Douglas Tutty wrote:
> > > > On Fri, Nov 03, 2006 at 08:27:42PM +0000, michael wrote:
> > [...]
> > > > > eg for
> > > > > junk info 18 Pro
> > > > > cessor
> > > > > I wish to get the field '18'
> [...]
> >
> > Here's a version of Douglas' python script that I got to run:
> >
> > --------------------------------------------
> >
> > #!/usr/bin/python
> >
> > IN = open('IN')
> > instring = IN.read()
> >
> > onelinestring = instring.replace('\n', ' ')
> >
> > inlist = onelinestring.split()
> >
> > oldword = ' '
> >
> > for newword in inlist:
> >
> > 	if newword == 'Processor':
> > 		print oldword
> > 	oldword = newword
> > -------------------------------------------------------------
> >
> 
> Or, now that I've seen Ken's contribution:
> 
> -------------------------------------------
> 
> #!/usr/bin/python 
> 
> for newword in open('IN').read().replace('\n', '').split():
>  
> 	if newword == 'Processor':
> 		print oldword	
> 	oldword = newword
> 
> --------------------------------------------
> 
> Either way, I like Douglas' approach of removing the newlines - or perhaps 
> these loops are inefficient?
> 

After thinking about it, yes it can all go in one line.  Its more
elegant and doesn't use up memory space but its harder to read to
understand what its doing.  Its also harder for someone who doesn't know
phyton to see it  (is this psudocode?).

I also found that bug where I was replacing \n with ' ' instead of ''.
I was looking for a .remove method and couldn't find it.

As far as inefficiency, theres some information we don't have that we
need to inorder to optimize this.  

	How many times will this run?

	How big are the input files?

I also don't know how the python internals deal with using a one-line
approach.  If it internally uses temporary storage areas the same way I
used intermediate variables then there's no performance advantage to
doing it one step at a time.  If, on the other hand, its smart enough to
process the input file once on a character by charater basis in one
pass, then a one-liner makes more performance sense.

The reason for the looping is that, while python can give you an index
of the first instance of a string ('processor'), it can't give you a
list of indices of all instances of a string.  If this was needed
frequently and proved to be a bottleneck, then a function could be
written to do this.  However, then you'd still need to iterate through
the list printing out index-1 to get the word prior to 'processor'.

In any event, at least with python I can see what I'm trying to do.  If
you really like regex, theres a module for python.

Doug.



Reply to: