[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#656142: ITP: duff -- Duplicate file finder



> On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
> > * Package name    : duff
> > * URL             : http://duff.sourceforge.net/

On Tue, 2012-01-17 at 09:56 +0100, Simon Josefsson wrote:
> If there aren't warnings about use of SHA1 in the tool, there should
> be. While I don't recall any published SHA1 collisions, SHA1 is
> considered broken and shouldn't be used if you want to trust your
> comparisons.  I'm assuming the tool supports SHA256 and other SHA2
> hashes as well?  It might be useful to make sure the defaults are
> non-SHA1.

Duff supports SHA1, SHA256, SHA384 and SHA512 hashes.  The default is
SHA1.  For comparison, rdfind supports MD5 but only SHA1 hashes.  Thanks
for the note Simon -- I'll bring it to the attention of the upstream
author, Camilla Berglund.  

On Tue, 2012-01-17 at 09:12 +0000, Lars Wirzenius wrote:
> rdfind seems to be quickest one, but duff compares well with hardlink,
> which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
> Debian so far.
> 
> This was done using my benchmark-cmd utility in my extrautils
> collection (not in Debian): http://liw.fi/extrautils/ for source.

Thanks for the pointer to your benchmark-cmd tool, Lars.  Very handy!
My results with it mirrored yours -- of the similar tools, duff appears
to lag only rdfind in performance (for my particular dataset, at least).

I looked into duff's methods a bit and discovered a few easy performance
optimizations that may speed it up a bit more.  The author is reviewing
my proposed patch now, and seems very open to collaboration.

> Personally, I would be wary of using checksums for file comparisons,
> since comparing files byte-by-byte isn't slow (you only need to
> do it to files that are identical in size, and you need to read
> all the files anyway).

Byte-by-byte might well be slower then checksums, if you end up faced
with N>2 very large (uncacheable) files of identical size but unique
contents.  They all need to be checked against each other so each of the
N files would need to be read N-1 times.   Anyway, duff actually *does*
offer byte-by-byte comparison as an option (rdfind does not).

> I also think we've now got enough of duplicate file finders in
> Debian that it's time to consider whether we need so many. It's
> too bad they all have incompatible command line syntaxes, or it
> would be possible to drop some. (We should accept a new one if
> it is better than the existing ones, of course. Evidence required.)

To me, the premise that a new package must be better than existing
similar ones ("evidence required", no less) seems pretty questionable.
It may not be so easy to establish just what "better than" means, and it
puts us in a position of making value judgments for our users that they
should be able to make for themselves.

While I do think it is productive to compare performance of these
similar tools to each other, I don't see much value in pitting them
against each other in benchmark wars as criteria of acceptance into
Debian.

Here we have a good quality DFSG-compliant package with an active
upstream and a willing DD maintainer.  While similar tools do exist
already in Debian, they do not offer identical feature sets or user
interfaces, and only one of them has been shown to outperform duff in
quick spot checks.  Some users have expressed a preference for duff over
the others.  Does that make it "better than the existing ones"?  My
answer: Who cares? Nobody is making us choose only one.

In my view, its not really a problem if carry multiple duplicate file
detectors in Debian, and that we will best serve our users by letting
them choose their preferred tool for the job.  And by allowing such
packages into Debian we encourage their improvement, to everyone's
benefit.

 -Kamal

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: