RE: finding similar files
> From: Eric Gerlach [mailto:firstname.lastname@example.org]
> Sent: Friday, February 27, 2009 11:03 AM
> Subject: Re: finding similar files
> On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> > There wouldn't happen to be any handy tools for searching a
> > tree with a few hundred ASCII files and telling me which ones have
> > similar content?
> > Many have been copied, edited, merged, reformatted, split, and I'd
> > to find the differences, decide on what to keep, and delete
> > ones.
> > I know there's such a program for image files.
> > I know about wdiff, which would be fine after I've paired off the
> > files (or fragments of files). to resolve differences that remain.
> You could write a script that would brute force all possible pairs of
> (yes, I know that's big, but it's only 125 000 for 500 files), run
> "wdiff -s", and then set some threshold for similarity on the
> Then, you get a list of potential matches.
> The only trick is setting the threshold... and I have no idea how to
> And if you're looking for fragments of files, that's a whole different
> Eric Gerlach, Network Administrator
> Federation of Students
> University of Waterloo
> p: (519) 888-4567 x36329
> e: email@example.com
I would probably do something similar as what Eric mentioned, but I
would weed out duplicates first. Try using fdupes. I tend to use:
`fdupes /your/dir/ -rS`
Add the -d to it to delete as you go, but I highly encourage you to read
up on the man page first and probably test it on something you don't
care for so you know how it works.
Hope this helps!