[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: quick scripting question - finding occurrence in many lines

On Wednesday 08 November 2006 03:08, Andrew Sackville-West wrote:
> On Wed, Nov 08, 2006 at 02:51:20AM +1100, John O'Hagan wrote:
> > I tried this, and found that replacing the newlines with spaces stops the
> > grep from working because it puts spaces in the middle of any occurrences
> > of "Processor", but I see what you mean about the edge case. I think this
> > version takes care of it, plus it is hyphen-agnostic:
> >
> >
> > tr  -d '\n'  <IN | sed s/P-*r-*o-*c-*e-*s-*s-*o-*r/' Processor'/g |
> > tr -s ' ' '\n' | grep -B1 'Processor' |  grep -v 'Processor\|--'
> >
> > removing newlines, replacing all cases of (non-)hyphenated "Processor"
> > with a space followed by "Processor", then doing the grep. And here's a
> > Python version using the re module to deal with the hyphens ( the edge
> > case takes care of itself here):
> >
> > import re
> >
> > for i in re.split('P-?r-?o-?c-?e-?s-?s-?o-?r',  
>>open('IN').read().replace('\n', ''))[0:-1]:
> > 		print i.split()[-1]
> huh, I'm not sure. I played with it a little and here's another
> problem
> here is some testing
> data processor
> will return 'testingdata' because the newlines get stripped out
> leaving no space between the words. so..
> first, replace all '-\n' with '' so we dehyphenate any hyphenated
> words split by a newline. there will be some words that should be
> hyphenated but lose that hyphen, however, I think that's probably a
> pretty rare case and it ignores any mid-line hyphenated words. also
> makes it easier to grep as we can ignore the hyphens in processor  next
> replace all '\n' with ' ' so that we avoid the above problem. then
> replace any single-or-more occurance of ' ' with '\n' to split the
> words into seperate lines and finally grep away.
> tr -d '-\n' <IN | tr '\n' ' ' | tr -s ' ' '\n' | grep -B1 'Processor'
> | grep -v 'Processor\|--'


Aha! You're right, my lines fail on the edge cases, and also when the target 
word is hyphenated.

Your ingenious approach didn't always work either [1]; but it revealed (to me) 
that there will be unresolvable ambiguities in the IN file unless:
EITHER: A) lines are broken arbitrarily without hyphenation, in which case 
newlines have no significance, spaces between words must preserved and we can 

#tr -d '\n' < IN | tr ' ' '\n' | grep -B1 Processor | grep -v 'Processor\|--'

or in Python:

#for i in  open('IN').read().replace('\n', '').split('Processor')[0:-1]:
#	print i.split()[-1] 

OR: B) broken words are hyphenated, and unhyphenated newlines are equivalent 
to spaces, in which case we could use something like:

while read i ; do

	if [[ $(echo "$i" | grep \\-\$ ) ]]; then
		i=$( echo "$i" | sed s/-\$//) 
		echo "$i" 
	else echo "$i"' '

done < IN | tr -d '\n' | tr ' ' '\n' | grep -B1 'Processor' | 
grep -v 'Processor\|--'

This removes hyphens at the end of lines or else adds a space, which converts 
the file to the unhyphenated case above - or in Python it's simpler:

#for i in  open('IN').read().replace('-\n', '').split('Processor')[0:-1]:
#	print i.split()[-1]

If the IN file does not adhere to A or B, it would impossible in principle to 
distinguish between a split word, and two words at the end and beginning of 
consecutive lines. In other words, the "edge case" problem and the hyphen 
problem are only solvable separately.

I'm probably re-inventing the wheel here; but it's very instructive in terms 
of general string parsing - I'm particularly impressed by how easily Python 
adapts to different scenarios.

Pesky hyphens!



[1] I tried Andrew's solution above and found that it only always worked on 
the unhyphenated case, I think because tr treats its arguments as character 
sets, not expresions, so that tr -d '\-\n' (note the escape required for the 
hyphen) deletes any hyphens or newlines, not just that combination.

Reply to: