RE: finding similar files
> From: Eric Gerlach [mailto:egerlach@feds.uwaterloo.ca]
> Sent: Friday, February 27, 2009 11:03 AM
> Subject: Re: finding similar files
>
> On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> > There wouldn't happen to be any handy tools for searching a
directory
> > tree with a few hundred ASCII files and telling me which ones have
> > similar content?
> >
> > Many have been copied, edited, merged, reformatted, split, and I'd
like
> > to find the differences, decide on what to keep, and delete
redundant
> > ones.
> >
> > I know there's such a program for image files.
> >
> > I know about wdiff, which would be fine after I've paired off the
> similar
> > files (or fragments of files). to resolve differences that remain.
>
> You could write a script that would brute force all possible pairs of
> files
> (yes, I know that's big, but it's only 125 000 for 500 files), run
them
> through
> "wdiff -s", and then set some threshold for similarity on the
statistics.
> Then, you get a list of potential matches.
>
> The only trick is setting the threshold... and I have no idea how to
help
> you
> there.
>
> And if you're looking for fragments of files, that's a whole different
> ballgame.
>
> Cheers,
>
> --
> Eric Gerlach, Network Administrator
> Federation of Students
> University of Waterloo
> p: (519) 888-4567 x36329
> e: egerlach@feds.uwaterloo.ca
I would probably do something similar as what Eric mentioned, but I
would weed out duplicates first. Try using fdupes. I tend to use:
`fdupes /your/dir/ -rS`
Add the -d to it to delete as you go, but I highly encourage you to read
up on the man page first and probably test it on something you don't
care for so you know how it works.
Hope this helps!
~Stack~
Reply to: