Bug#662080: ITP: hadori -- Hardlinks identical files

To: Timo Weingrtner <timo@tiwe.de>
Cc: 662080@bugs.debian.org, submit@bugs.debian.org
Subject: Bug#662080: ITP: hadori -- Hardlinks identical files
From: Goswin von Brederlow <goswin-v-b@web.de>
Date: Sun, 04 Mar 2012 07:00:13 +0100
Message-id: <[🔎] 87lingkis2.fsf@frosties.localnet>
Reply-to: Goswin von Brederlow <goswin-v-b@web.de>, 662080@bugs.debian.org
In-reply-to: <[🔎] 201203040031.21680.timo@tiwe.de> ("Timo Weingrtner"'s message of "Sun, 4 Mar 2012 00:31:16 +0100")
References: <[🔎] 201203040031.21680.timo@tiwe.de>

Timo WeingÃ¤rtner <timo@tiwe.de> writes:

> Package: wnpp
> Severity: wishlist
> X-Debbugs-CC: debian-devel@lists.debian.org
>
>    Package name: hadori
>         Version: 0.2
> Upstream Author: Timo WeingÃ¤rtner <timo@tiwe.de>
>             URL: https://github.com/tiwe-de/hadori
>         License: GPL3+
>     Description: Hardlinks identical files
>  This might look like yet another hardlinking tool, but it is the only one
>  which only memorizes one filename per inode. That results in less merory
>  consumption and faster execution compared to its alternatives. Therefore
>  (and because all the other names are already taken) it's called
>  "HArdlinking DOne RIght".
>  .
>  Advantages over other hardlinking tools:
>   * predictability: arguments are scanned in order, each first version is kept
>   * much lower CPU and memory consumption
>   * hashing option: speedup on many equal-sized, mostly identical files
>
> The initial comparison was with hardlink, which got OOM killed with a hundred 
> backups of my home directory. Last night I compared it to duff and rdfind 
> which would have happily linked files with different st_mtime and st_mode.
>
> I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
> number.
>
>
> Greetings
> Timo

I've been thinking about the problem of memory consumption too. But I've
come to a different solution. One that doesn't need memory at all.

Instead of remembering inodes, filenames and checksums create a global
cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
and hardlink every file to there. If you want/need to include uid, gid,
mtime, mode in there then make that part of the .cache path.

Garbage collection in the cache would be removing all files with a link
count of 1.

Going one step further link files with unique size [uid, gid, mtime,
...] to .cache/<size> and change that into .cache/<size>/<start of
hash>/<hash> when you find a second file with the same size that isn't
identical. That would save on the expensive hashing of clearly unique
files.

You could also use a hash that computes the first byte from the first
4k, second byte from 64k, thrid from 1mb and so on. That way you can
check if the beginning of 2 files match without having to checksum the
whole file or literally comprare the two.

MfG
        Goswin

Reply to:

Follow-Ups:
- Bug#662080: ITP: hadori -- Hardlinks identical files
  - From: Julian Andres Klode <jak@debian.org>

References:
- Bug#662080: ITP: hadori -- Hardlinks identical files
  - From: Timo Weingärtner <timo@tiwe.de>

Prev by Date: Processed: forcibly merging 662105 662107
Next by Date: Processed: tagging as pending bugs that are closed by packages in NEW
Previous by thread: Bug#662080: ITP: hadori -- Hardlinks identical files
Next by thread: Bug#662080: ITP: hadori -- Hardlinks identical files
Index(es):
- Date
- Thread