Re: quick scripting question - finding occurrence in many lines
On Wednesday 08 November 2006 03:08, Andrew Sackville-West wrote:
> On Wed, Nov 08, 2006 at 02:51:20AM +1100, John O'Hagan wrote:
> > I tried this, and found that replacing the newlines with spaces stops the
> > grep from working because it puts spaces in the middle of any occurrences
> > of "Processor", but I see what you mean about the edge case. I think this
> > version takes care of it, plus it is hyphen-agnostic:
> > tr -d '\n' <IN | sed s/P-*r-*o-*c-*e-*s-*s-*o-*r/' Processor'/g |
> > tr -s ' ' '\n' | grep -B1 'Processor' | grep -v 'Processor\|--'
> > removing newlines, replacing all cases of (non-)hyphenated "Processor"
> > with a space followed by "Processor", then doing the grep. And here's a
> > Python version using the re module to deal with the hyphens ( the edge
> > case takes care of itself here):
> > import re
> > for i in re.split('P-?r-?o-?c-?e-?s-?s-?o-?r',
> > print i.split()[-1]
> huh, I'm not sure. I played with it a little and here's another
> here is some testing
> data processor
> will return 'testingdata' because the newlines get stripped out
> leaving no space between the words. so..
> first, replace all '-\n' with '' so we dehyphenate any hyphenated
> words split by a newline. there will be some words that should be
> hyphenated but lose that hyphen, however, I think that's probably a
> pretty rare case and it ignores any mid-line hyphenated words. also
> makes it easier to grep as we can ignore the hyphens in processor next
> replace all '\n' with ' ' so that we avoid the above problem. then
> replace any single-or-more occurance of ' ' with '\n' to split the
> words into seperate lines and finally grep away.
> tr -d '-\n' <IN | tr '\n' ' ' | tr -s ' ' '\n' | grep -B1 'Processor'
> | grep -v 'Processor\|--'
Aha! You're right, my lines fail on the edge cases, and also when the target
word is hyphenated.
Your ingenious approach didn't always work either ; but it revealed (to me)
that there will be unresolvable ambiguities in the IN file unless:
EITHER: A) lines are broken arbitrarily without hyphenation, in which case
newlines have no significance, spaces between words must preserved and we can
#tr -d '\n' < IN | tr ' ' '\n' | grep -B1 Processor | grep -v 'Processor\|--'
or in Python:
#for i in open('IN').read().replace('\n', '').split('Processor')[0:-1]:
# print i.split()[-1]
OR: B) broken words are hyphenated, and unhyphenated newlines are equivalent
to spaces, in which case we could use something like:
while read i ; do
if [[ $(echo "$i" | grep \\-\$ ) ]]; then
i=$( echo "$i" | sed s/-\$//)
else echo "$i"' '
done < IN | tr -d '\n' | tr ' ' '\n' | grep -B1 'Processor' |
grep -v 'Processor\|--'
This removes hyphens at the end of lines or else adds a space, which converts
the file to the unhyphenated case above - or in Python it's simpler:
#for i in open('IN').read().replace('-\n', '').split('Processor')[0:-1]:
# print i.split()[-1]
If the IN file does not adhere to A or B, it would impossible in principle to
distinguish between a split word, and two words at the end and beginning of
consecutive lines. In other words, the "edge case" problem and the hyphen
problem are only solvable separately.
I'm probably re-inventing the wheel here; but it's very instructive in terms
of general string parsing - I'm particularly impressed by how easily Python
adapts to different scenarios.
 I tried Andrew's solution above and found that it only always worked on
the unhyphenated case, I think because tr treats its arguments as character
sets, not expresions, so that tr -d '\-\n' (note the escape required for the
hyphen) deletes any hyphens or newlines, not just that combination.