[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RE: finding similar files

> From: Eric Gerlach [mailto:egerlach@feds.uwaterloo.ca]
> Sent: Friday, February 27, 2009 11:03 AM
> Subject: Re: finding similar files
> On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> > There wouldn't happen to be any handy tools for searching a
> > tree with a few hundred ASCII files and telling me which ones have
> > similar content?
> >
> > Many have been copied, edited, merged, reformatted, split, and I'd
> > to find the differences, decide on what to keep, and delete
> > ones.
> >
> > I know there's such a program for image files.
> >
> > I know about wdiff, which would be fine after I've paired off the
> similar
> > files (or fragments of files). to resolve differences that remain.
> You could write a script that would brute force all possible pairs of
> files
> (yes, I know that's big, but it's only 125 000 for 500 files), run
> through
> "wdiff -s", and then set some threshold for similarity on the
> Then, you get a list of potential matches.
> The only trick is setting the threshold... and I have no idea how to
> you
> there.
> And if you're looking for fragments of files, that's a whole different
> ballgame.
> Cheers,
> --
> Eric Gerlach, Network Administrator
> Federation of Students
> University of Waterloo
> p: (519) 888-4567 x36329
> e: egerlach@feds.uwaterloo.ca

I would probably do something similar as what Eric mentioned, but I
would weed out duplicates first. Try using fdupes. I tend to use:
`fdupes /your/dir/ -rS`
Add the -d to it to delete as you go, but I highly encourage you to read
up on the man page first and probably test it on something you don't
care for so you know how it works.

Hope this helps!


Reply to: