[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: finding similar files



On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> There wouldn't happen to be any handy tools for searching a directory 
> tree with a few hundred ASCII files and telling me which ones have 
> similar content?
> 
> Many have been copied, edited, merged, reformatted, split, and I'd like
> to find the differences, decide on what to keep, and delete redundant
> ones.
> 
> I know there's such a program for image files.
> 
> I know about wdiff, which would be fine after I've paired off the similar 
> files (or fragments of files). to resolve differences that remain.

You could write a script that would brute force all possible pairs of files
(yes, I know that's big, but it's only 125 000 for 500 files), run them through
"wdiff -s", and then set some threshold for similarity on the statistics.
Then, you get a list of potential matches.

The only trick is setting the threshold... and I have no idea how to help you
there.

And if you're looking for fragments of files, that's a whole different
ballgame.

Cheers,

-- 
Eric Gerlach, Network Administrator
Federation of Students
University of Waterloo
p: (519) 888-4567 x36329
e: egerlach@feds.uwaterloo.ca


Reply to: