[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#656142: ITP: duff -- Duplicate file finder

On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
> * Package name    : duff
> * URL             : http://duff.sourceforge.net/

A quick speed comparison:

real  user  system  max RSS  elapsed  cmd                                   
 (s)   (s)     (s)    (KiB)      (s)                                        
 3.2   2.4     5.8    62784      5.8  hardlink --dry-run files > /dev/null  
 1.1   0.4     1.6    15424      1.6  rdfind files > /dev/null              
 1.9   0.2     2.2     9904      2.2  duff-0.5/src/duff -r files > /dev/null

rdfind seems to be quickest one, but duff compares well with hardlink,
which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
Debian so far.

This was done using my benchmark-cmd utility in my extrautils
collection (not in Debian): http://liw.fi/extrautils/ for source.
The exact command to generate the above table:

benchmark-cmd \
    --setup='genbackupdata --create=100m files' \
    --setup='cp -a files/0 files/copy' \
    --cleanup='rm -rf files' \
    --verbose \
    --command='hardlink --dry-run files > /dev/null' \
    --command='rdfind files > /dev/null' \
    --command='duff-0.5/src/duff -r files > /dev/null'

Personally, I would be wary of using checksums for file comparisons,
since comparing files byte-by-byte isn't slow (you only need to
do it to files that are identical in size, and you need to read
all the files anyway).

I also think we've now got enough of duplicate file finders in
Debian that it's time to consider whether we need so many. It's
too bad they all have incompatible command line syntaxes, or it
would be possible to drop some. (We should accept a new one if
it is better than the existing ones, of course. Evidence required.)

Freedom-based blog/wiki/web hosting: http://www.branchable.com/

Attachment: signature.asc
Description: Digital signature

Reply to: