Re: Bug#656142: ITP: duff -- Duplicate file finder
Samuel Thibault, le Tue 17 Jan 2012 12:15:16 +0100, a écrit :
> Lars Wirzenius, le Tue 17 Jan 2012 10:45:20 +0000, a écrit :
> > On Tue, Jan 17, 2012 at 10:30:20AM +0100, Samuel Thibault wrote:
> > > Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> > > > real user system max RSS elapsed cmd
> > > > (s) (s) (s) (KiB) (s)
> > > > 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> > > > 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> > > > 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
> > >
> > > And fdupes on the same set of files?
> >
> > real user system max RSS elapsed cmd
> > (s) (s) (s) (KiB) (s)
> > 3.1 2.4 5.5 62784 5.5 hardlink --dry-run files > /dev/null
> > 1.1 0.4 1.6 15392 1.6 rdfind files > /dev/null
> > 1.3 0.9 2.2 13936 2.2 fdupes -r -q files > /dev/null
> > 1.9 0.2 2.1 9904 2.1 duff-0.5/src/duff -r files > /dev/null
> >
> > Someone should run the benchmark on a large set of data, preferably
> > on various kinds of real data, rather than my small synthetic data set.
>
> On my PhD work directory, with various stuff in it (500MiB, 18000 files,
> big but also small files (svn/git checkouts etc)), everything being in
> cache already (no disk I/O):
>
> hardlink -t --dry-run . > /dev/null 1,06s user 0,46s system 99% cpu 1,538 total
> rdfind . > /dev/null 0,68s user 0,19s system 99% cpu 0,877 total
> fdupes -q -r . > /dev/null 2> /dev/null 0,80s user 0,90s system 99% cpu 1,708 total
> ~/src/duff-0.5/src/duff -r . > /dev/null 1,53s user 0,08s system 99% cpu 1,610 total
And with nothing in cache, SSD hard drive:
hardlink -t --dry-run . > /dev/null 1,86s user 1,23s system 12% cpu 24,260 total
rdfind . > /dev/null 1,18s user 1,31s system 8% cpu 27,837 total
fdupes -q -r . > /dev/null 2> /dev/null 1,30s user 2,13s system 11% cpu 29,820 total
~/src/duff-0.5/src/duff -r . > /dev/null 1,88s user 0,47s system 16% cpu 13,949 total
(yes, user time is different, and measures are stable. Also note that
I have added -t to hardlink, otherwise it takes file timestamp into
account).
I guess duff gets a clear win because it does not systematically compute
the checksum of files with the same size, but first reads a few bytes,
for the big files.
samuel
Reply to: