Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file

To: debian-devel@lists.debian.org
Subject: Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
From: Andrew Vaughan <ajv-lists@netspace.net.au>
Date: Fri, 16 Jan 2009 01:10:22 +1100
Message-id: <[🔎] 200901160110.23011.ajv-lists@netspace.net.au>
In-reply-to: <4953CB3B.7080403@complete.org>
References: <20081224192523.GA29189@jak-linux.org> <4953CB3B.7080403@complete.org>

On Fri, 26 Dec 2008, John Goerzen wrote:
> Julian Andres Klode wrote:
> >  Hardlink is a tool which detects multiple copies of the same file and
> > replaces them with hardlinks.
> >  .
> >  The idea has been taken from http://code.google.com/p/hardlinkpy/, but
> > the code has been written from scratch and licensed under the MIT
> > license.
>
> Do we really need another tool like this?
>
> We already have these packages:
>
>   fdupes
>   perforate
>
Hi John

I think that's a little harsh.  There are lots of apps in Debian that 
provide similar functionality to other apps in Debian.  I do agree that iff 
hardlink is only duplicating functionality available in finddup, then there 
is no point in maintaining both.  

I wasn't aware of finddup before this thread.  Since faster-dupemerge [0]  
breaks with the upgrade to lenny I thought I'd give finddup a try.  However 
finddup is too limited for my use case.

My home server stores dirvish (think rsync --link-dest) backups for 2 other 
machines that dual boot windows and linux.  Each partition is backed up 
separately, with the windows partitions having separate backups from both 
windows and linux.  In addition the linux partitions sometimes contain  
chroots , and the Windows partitions have games installed, then copied to a 
different dir and modded.  That means there is a lot of duplicate files 
that rsync --link-dest doesn't hardlink.  Hardlinking files afterwards 
enables me to get over 1 TB of used disk space X 60 days onto a single 1 TB 
disk.

Finddup assumes that the file list will fit in memory.  This is a 
showstopper for me.  Attempting to run finddup on my home server over a 
partial backup set of a single day (1,898,219 files) resulted in 
unacceptable memory usage (739MB after 4 hours on a machine with 512MB 
physical ram.  This resulted in swap usage of over 600MB, and a 30 sec ssh 
login time).  

Findup lacks an option to require matching timestamps before hardlinking.  
This discards info that can be useful in a backup, and results in rsync 
thinking that the files have changed, and retransmitting them anyway.

Finddup's syntax for specifying directories to link is clumsy when what I 
really want to link is /srv/dirvish/*/2009.01.1*/tree.  

In addition faster-dupemerge's ability to pass options to find means that I 
can do a quick partial run by limiting find to files large than 1MB, 
something that is often enough to recover 10+ GB after installing a couple 
of games.

Cheers 
Andrew V.

[0] http://www.furryterror.org/~zblaxell/projects/dupemerge/dupemerge.html

Reply to:

Follow-Ups:
- Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
  - From: Didier Raboud <didier@raboud.com>
- Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
  - From: "Sandro Tosi" <matrixhasu@gmail.com>

Prev by Date: Bug#511899: ITP: coinor-csdp -- A software package for semidefinite programming
Next by Date: Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
Previous by thread: Bug#511899: ITP: coinor-csdp -- A software package for semidefinite programming
Next by thread: Re: Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file
Index(es):
- Date
- Thread