Re: Finding the common textual denominator
On Sunday 06 March 2005 10.56, Joey Hess wrote:
> Ron Johnson wrote:
> > On Sun, 2005-03-06 at 02:16 +0100, Olle Eriksson wrote:
> > > Can anyone help me with how to find the common textual denominator
> > > of an array of strings. I have been searching the web and the man
> > > pages of grep, awk etc to no avail.
> > >
> > > Given the following list of directory names I want to have a script
> > > return "Eric_Clapton".
> > >
> > > Eric_Clapton-Big_Boss_Man-2CD-Retail-2002-DGN/
> > > Eric_Clapton-Higher_Ground-(CDS)-2003-RNS/
> > > Eric_Clapton_-_Me_and_Mr_Johnson-(PROPER)-CD-2004-TN/
> > > Eric_Clapton-One_More_Car_One_More_Rider-2CD-2002-RNS/
> > > Eric_Clapton - Pilgrim/
> >
> > You want a generic algorithm?
>
> Assuming word splitting is ok and you want to avoid O(N^2) methods:
>
> joey@dragon:~>cat foo
> foo by Clapton, Eric
> Eric_Clapton-Big_Boss_Man-2CD-Retail-2002-DGN/
> Eric_Clapton-Higher_Ground-(CDS)-2003-RNS/
> Eric_Clapton_-_Me_and_Mr_Johnson-(PROPER)-CD-2004-TN/
> Eric_Clapton-One_More_Car_One_More_Rider-2CD-2002-RNS/
> Eric_Clapton - Pilgrim/
> joey@dragon:~>perl -e 'while (<>) { my %seen; foreach my $w (split
> /[^a-zA-Z0-9]/) { next unless length $w; $count{$w}++ unless $seen{$w};
> $seen{$w}=1 } }; foreach (keys %count) { $max=$count{$_} if $max <
> $count{$_} }; foreach (keys %count) { print "$_\n" if $count{$_} ==
> $max }' < foo Clapton
> Eric
Wow.. that little script works, sort of, although not quite as I would
like it to.
If I want a generic algorithm? No, not in the sense that it should find
anything in common between the strings. I should have been more clear on
that. It is well enough to find only the common beginnings of the
strings. So with Joey Hess' example above it would return either nothing,
if using forward iteration beginning at the top, or even better return
"Eric_Clapton" in one word if it could be that smart. And it should treat
white space as just another character.
An example:
aa bbbcc
bb bccdd
bb bbcdd
bb bbbdd
..should return either "", or "bb b" (even better).
I guess I could write a small c program to do that but I figured maybe it
can be done with grep or something. Maybe I am wrong.
Thanks
Olle
Reply to: