Bug#662080: ITP: hadori -- Hardlinks identical files
Timo Weingärtner <timo@tiwe.de> writes:
> Package: wnpp
> Severity: wishlist
> X-Debbugs-CC: debian-devel@lists.debian.org
>
> Package name: hadori
> Version: 0.2
> Upstream Author: Timo Weingärtner <timo@tiwe.de>
> URL: https://github.com/tiwe-de/hadori
> License: GPL3+
> Description: Hardlinks identical files
> This might look like yet another hardlinking tool, but it is the only one
> which only memorizes one filename per inode. That results in less merory
> consumption and faster execution compared to its alternatives. Therefore
> (and because all the other names are already taken) it's called
> "HArdlinking DOne RIght".
> .
> Advantages over other hardlinking tools:
> * predictability: arguments are scanned in order, each first version is kept
> * much lower CPU and memory consumption
> * hashing option: speedup on many equal-sized, mostly identical files
>
> The initial comparison was with hardlink, which got OOM killed with a hundred
> backups of my home directory. Last night I compared it to duff and rdfind
> which would have happily linked files with different st_mtime and st_mode.
>
> I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug
> number.
>
>
> Greetings
> Timo
I've been thinking about the problem of memory consumption too. But I've
come to a different solution. One that doesn't need memory at all.
Instead of remembering inodes, filenames and checksums create a global
cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
and hardlink every file to there. If you want/need to include uid, gid,
mtime, mode in there then make that part of the .cache path.
Garbage collection in the cache would be removing all files with a link
count of 1.
Going one step further link files with unique size [uid, gid, mtime,
...] to .cache/<size> and change that into .cache/<size>/<start of
hash>/<hash> when you find a second file with the same size that isn't
identical. That would save on the expensive hashing of clearly unique
files.
You could also use a hash that computes the first byte from the first
4k, second byte from 64k, thrid from 1mb and so on. That way you can
check if the beginning of 2 files match without having to checksum the
whole file or literally comprare the two.
MfG
Goswin
Reply to: