RE: finding similar files

To: <debian-user@lists.debian.org>
Subject: RE: finding similar files
From: "Stackpole, Chris" <CStackpole@barbnet.com>
Date: Fri, 27 Feb 2009 11:48:48 -0600
Message-id: <[🔎] CAA2386151CF504E958CEFB0801DFC560116DD87@mx-mailstore.BassCompanies.Com>
In-reply-to: <[🔎] 20090227170234.GB332@wks0003>
References: <[🔎] go44d8$ln2$1@ger.gmane.org> <[🔎] 20090227170234.GB332@wks0003>

> From: Eric Gerlach [mailto:egerlach@feds.uwaterloo.ca]
> Sent: Friday, February 27, 2009 11:03 AM
> Subject: Re: finding similar files
> 
> On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote:
> > There wouldn't happen to be any handy tools for searching a
directory
> > tree with a few hundred ASCII files and telling me which ones have
> > similar content?
> >
> > Many have been copied, edited, merged, reformatted, split, and I'd
like
> > to find the differences, decide on what to keep, and delete
redundant
> > ones.
> >
> > I know there's such a program for image files.
> >
> > I know about wdiff, which would be fine after I've paired off the
> similar
> > files (or fragments of files). to resolve differences that remain.
> 
> You could write a script that would brute force all possible pairs of
> files
> (yes, I know that's big, but it's only 125 000 for 500 files), run
them
> through
> "wdiff -s", and then set some threshold for similarity on the
statistics.
> Then, you get a list of potential matches.
> 
> The only trick is setting the threshold... and I have no idea how to
help
> you
> there.
> 
> And if you're looking for fragments of files, that's a whole different
> ballgame.
> 
> Cheers,
> 
> --
> Eric Gerlach, Network Administrator
> Federation of Students
> University of Waterloo
> p: (519) 888-4567 x36329
> e: egerlach@feds.uwaterloo.ca

I would probably do something similar as what Eric mentioned, but I
would weed out duplicates first. Try using fdupes. I tend to use:
`fdupes /your/dir/ -rS`
Add the -d to it to delete as you go, but I highly encourage you to read
up on the man page first and probably test it on something you don't
care for so you know how it works.

Hope this helps!

~Stack~

Reply to:

References:
- finding similar files
  - From: Hendrik Boom <hendrik@topoi.pooq.com>
- Re: finding similar files
  - From: Eric Gerlach <egerlach@feds.uwaterloo.ca>

Prev by Date: Re: package management begins to annoy me
Next by Date: Re: Recommendation wanted for cheap PCI graphics card with S-Video TV out?
Previous by thread: Re: finding similar files
Next by thread: Re: finding similar files
Index(es):
- Date
- Thread