[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Finding the common textual denominator



On Sunday 06 March 2005 10.56, Joey Hess wrote:
> Ron Johnson wrote:
> > On Sun, 2005-03-06 at 02:16 +0100, Olle Eriksson wrote:
> > > Can anyone help me with how to find the common textual denominator
> > > of an array of strings. I have been searching the web and the man
> > > pages of grep, awk etc to no avail.
> > >
> > > Given the following list of directory names I want to have a script
> > > return "Eric_Clapton".
> > >
> > > Eric_Clapton-Big_Boss_Man-2CD-Retail-2002-DGN/
> > > Eric_Clapton-Higher_Ground-(CDS)-2003-RNS/
> > > Eric_Clapton_-_Me_and_Mr_Johnson-(PROPER)-CD-2004-TN/
> > > Eric_Clapton-One_More_Car_One_More_Rider-2CD-2002-RNS/
> > > Eric_Clapton - Pilgrim/
> >
> > You want a generic algorithm?
>
> Assuming word splitting is ok and you want to avoid O(N^2) methods:
>
> joey@dragon:~>cat foo
> foo by Clapton, Eric
> Eric_Clapton-Big_Boss_Man-2CD-Retail-2002-DGN/
> Eric_Clapton-Higher_Ground-(CDS)-2003-RNS/
> Eric_Clapton_-_Me_and_Mr_Johnson-(PROPER)-CD-2004-TN/
> Eric_Clapton-One_More_Car_One_More_Rider-2CD-2002-RNS/
> Eric_Clapton - Pilgrim/
> joey@dragon:~>perl -e 'while (<>) { my %seen; foreach my $w (split
> /[^a-zA-Z0-9]/) { next unless length $w; $count{$w}++ unless $seen{$w};
> $seen{$w}=1 } }; foreach (keys %count) { $max=$count{$_} if $max <
> $count{$_} }; foreach (keys %count) { print "$_\n" if $count{$_} ==
> $max }' < foo Clapton
> Eric

Wow.. that little script works, sort of, although not quite as I would 
like it to.

If I want a generic algorithm? No, not in the sense that it should find 
anything in common between the strings. I should have been more clear on 
that. It is well enough to find only the common beginnings of the 
strings. So with Joey Hess' example above it would return either nothing, 
if using forward iteration beginning at the top, or even better return 
"Eric_Clapton" in one word if it could be that smart. And it should treat 
white space as just another character.

An example:

aa bbbcc
bb bccdd
bb bbcdd
bb bbbdd

..should return either "", or "bb b" (even better).

I guess I could write a small c program to do that but I figured maybe it 
can be done with grep or something. Maybe I am wrong.

Thanks
Olle



Reply to: