[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#509685: ITP: hardlink -- Hardlink multiple copies of the same file



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I am the "chf" who requested the program on IRC in the first place, and
noticing the serious doubts expressed here whether that package is
really needed in Debian, I want to point out why I need it nonetheless,
and, more important here, why I think it would enrich Debian as
well.

I am not using the hardlink utility in spite of "rsync --link-dest"
already existing, I need it *because* I use this rsync feature for
backups.

More precisely, I am using "rsnapshot" which essentially allows holding
several generations of a backup tree, and conserving huge amounts of
space by connectig unchanged files with "rsync --link-dest" to the
previous generation.

However, I experienced a bug in rsnapshot causing a "gap" whenever one
host was down if a backup was scheduled in the meantime. Afterwards the
whole trees were unconnected again. The issue seems not to be easily
reproducible, so it's unresolved despite of my report on the rsnapshot
mailing list.

On my search for a remedy I came across several people who had similar
problems repairing "broken apart" rsnapshot mirrors, and
they evatually directed me towards "hardlink.py" which simply looks for
equal files in the set of given directories, linking any equal files
together.

This saves even more space as "rsnapshot" on its own, because *all*
equal files are linked, not only those with the same path and filename
from one backup generation to the immediate successor.

At a first glance, it seems that this is all the functionality needed,
so you could use the perforate package (where the feature is hidden
enough for me having not been able to find it) or enhance
"fdupe" as requested in #284274.

Because I've got many subtrees with a great total number of files,
letting one of the existing utilities run over the whole fileset will
last several weeks because the machine begins swapping.

So I need to split this into several runs over a subset of the trees.
These are to be handled independently in a way that none of the runs
"knows" anything of ht other ones. You can only achieve this by
"maximizing" the link count, thus replacing the file which is linked
to less other files with a hard link to the one which already has got a
higher link count.

I found a reference to such an enhancement to "hardlink.py" in a
mailing list and requested the patch. Several weeks later, I received
it from it's author via email. This tool was the first one which
really did what I needed.

After this odyssey, I decided that it would be a wasted effort to use
this script only on my own system, because I had read from others with
similar problems. I use Debian since a few years, and it has become my
favourite Distribution, so I considered creating a new package or
finding somebody more experienced who would do this. So I started
asking around at "#debian-devel.de" on OFTC, and Julian Andres Klode
offered to rewrite this program and make it available in Debian.

I'm not sure if I could explain the "maximize link count" feature
undestandably, so here is one more example:

Imagine you create several identical backup trees with only small
changes between them with "rsync --link-dest". Afterwards you move a
large (measured in file size) subtree to naother location in the
source. The next rsync run will copy this subtree again, but not link
it to the unchanged files of the previous run, because their
relative location has changed and it cannot identify them as being the
same. Now you run the "hardlink" utility over the new copy and *one*
of the older trees (its immediate predecessor) due to memory
constraints. If you have got a "normal" hardlink utility, there is no
way to make sure that indeed every equal file of the new set is
replaced by a link to the older one. If you are unlucky, it will just
"break off" the predecessor tree from all the oder trees and connect it
to the new one resulting in no space gain from this operation.

Additionally, the hardlink utility gives more control about what files
are considered "equal" than any other similar program I've seen so far.
In addition to it's contents you can have it also match owner/group,
name, timestamp, mode. The only feature I have not yet found a use for
is minimizing the link count.

Regards, Christoph, who hopes to have things made a bit more clear
rather than having annoyed you with this extremely long text.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iQIcBAEBAgAGBQJJVsFuAAoJEP0O2oBXmANfFR0QAJFcLNrNfPXofyzet+QooFPa
W5WuDZOoHIwKa9v5V6JkAaJbAJdrzJUjvvQPRuGlD1cxDN0cX35LptGQXOmT75p1
Q6EHk9soPfW4oCTllmjBBia8+khjm1NMDoW9IEq68TWKLdj0Byq/Wh1GY/8DmRBR
kWqSYievqiNZRtf57SO57izgsecNbcAyNVcsO1hsjH8qCFQJgOgG3VpgQPNw40N7
pRXkjG/SsdocKOoGfhhHAlkTx7osVCXHJMBqp6bcaCE4Juav/IK+vSm2gqRhuLxK
uRRDFYbf447L4bQcgL4NQCWzQ+/j9+XCFPanjGdjnqMbbMFBDhZdkiX3PzTEZMK2
W5hX2LHtSBA4vLIC58hOQxAjWWZvjZg3/TQufNUTPriySJGrsbglLegY8xEKH0cn
RYlA2d9bs2+tdBbnSajK+O+TzTXk7WsCT7xApZAm4UrfbO5Otg4ykOry5L8uXCim
FLsjBwdowoJRIyRDrgca5PBazZs1TXR40JQUnH9qO7WkNBDJFVRnAq/yoL7KZdbI
wABDcxuVF8mpUG/jBrc8A6M8hRdFaTEyWtMMWDzRRbxDhynnKxujM2SQlt8j0ig6
Hrt/tPg3AAM3iYQ2EcONeDzIekzZW45wNlhR0HJ2qBzYVAwdrl/9ZUxF95cvAoov
ZeMf26GYtrZ+oftDN2/D
=sUFO
-----END PGP SIGNATURE-----

Reply to: