Re: finding similar files
On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> There wouldn't happen to be any handy tools for searching a directory
> tree with a few hundred ASCII files and telling me which ones have
> similar content?
>
> Many have been copied, edited, merged, reformatted, split, and I'd like
> to find the differences, decide on what to keep, and delete redundant
> ones.
>
> I know there's such a program for image files.
>
> I know about wdiff, which would be fine after I've paired off the similar
> files (or fragments of files). to resolve differences that remain.
You could write a script that would brute force all possible pairs of files
(yes, I know that's big, but it's only 125 000 for 500 files), run them through
"wdiff -s", and then set some threshold for similarity on the statistics.
Then, you get a list of potential matches.
The only trick is setting the threshold... and I have no idea how to help you
there.
And if you're looking for fragments of files, that's a whole different
ballgame.
Cheers,
--
Eric Gerlach, Network Administrator
Federation of Students
University of Waterloo
p: (519) 888-4567 x36329
e: egerlach@feds.uwaterloo.ca
Reply to: