[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RE: finding similar files



> From: Eric Gerlach [mailto:egerlach@feds.uwaterloo.ca]
> Sent: Friday, February 27, 2009 11:03 AM
> Subject: Re: finding similar files
> 
> On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> > There wouldn't happen to be any handy tools for searching a
directory
> > tree with a few hundred ASCII files and telling me which ones have
> > similar content?
> >
> > Many have been copied, edited, merged, reformatted, split, and I'd
like
> > to find the differences, decide on what to keep, and delete
redundant
> > ones.
> >
> > I know there's such a program for image files.
> >
> > I know about wdiff, which would be fine after I've paired off the
> similar
> > files (or fragments of files). to resolve differences that remain.
> 
> You could write a script that would brute force all possible pairs of
> files
> (yes, I know that's big, but it's only 125 000 for 500 files), run
them
> through
> "wdiff -s", and then set some threshold for similarity on the
statistics.
> Then, you get a list of potential matches.
> 
> The only trick is setting the threshold... and I have no idea how to
help
> you
> there.
> 
> And if you're looking for fragments of files, that's a whole different
> ballgame.
> 
> Cheers,
> 
> --
> Eric Gerlach, Network Administrator
> Federation of Students
> University of Waterloo
> p: (519) 888-4567 x36329
> e: egerlach@feds.uwaterloo.ca

I would probably do something similar as what Eric mentioned, but I
would weed out duplicates first. Try using fdupes. I tend to use:
`fdupes /your/dir/ -rS`
Add the -d to it to delete as you go, but I highly encourage you to read
up on the man page first and probably test it on something you don't
care for so you know how it works.

Hope this helps!

~Stack~


Reply to: