Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
On Fri, 26 Dec 2008, John Goerzen wrote:
> Julian Andres Klode wrote:
> > Hardlink is a tool which detects multiple copies of the same file and
> > replaces them with hardlinks.
> > .
> > The idea has been taken from http://code.google.com/p/hardlinkpy/, but
> > the code has been written from scratch and licensed under the MIT
> > license.
> Do we really need another tool like this?
> We already have these packages:
I think that's a little harsh. There are lots of apps in Debian that
provide similar functionality to other apps in Debian. I do agree that iff
hardlink is only duplicating functionality available in finddup, then there
is no point in maintaining both.
I wasn't aware of finddup before this thread. Since faster-dupemerge 
breaks with the upgrade to lenny I thought I'd give finddup a try. However
finddup is too limited for my use case.
My home server stores dirvish (think rsync --link-dest) backups for 2 other
machines that dual boot windows and linux. Each partition is backed up
separately, with the windows partitions having separate backups from both
windows and linux. In addition the linux partitions sometimes contain
chroots , and the Windows partitions have games installed, then copied to a
different dir and modded. That means there is a lot of duplicate files
that rsync --link-dest doesn't hardlink. Hardlinking files afterwards
enables me to get over 1 TB of used disk space X 60 days onto a single 1 TB
Finddup assumes that the file list will fit in memory. This is a
showstopper for me. Attempting to run finddup on my home server over a
partial backup set of a single day (1,898,219 files) resulted in
unacceptable memory usage (739MB after 4 hours on a machine with 512MB
physical ram. This resulted in swap usage of over 600MB, and a 30 sec ssh
Findup lacks an option to require matching timestamps before hardlinking.
This discards info that can be useful in a backup, and results in rsync
thinking that the files have changed, and retransmitting them anyway.
Finddup's syntax for specifying directories to link is clumsy when what I
really want to link is /srv/dirvish/*/2009.01.1*/tree.
In addition faster-dupemerge's ability to pass options to find means that I
can do a quick partial run by limiting find to files large than 1MB,
something that is often enough to recover 10+ GB after installing a couple