Bug#662080: ITP: hadori -- Hardlinks identical files

To: Julian Andres Klode <jak@debian.org>
Cc: Goswin von Brederlow <goswin-v-b@web.de>, 662080@bugs.debian.org, Timo Weingrtner <timo@tiwe.de>
Subject: Bug#662080: ITP: hadori -- Hardlinks identical files
From: Goswin von Brederlow <goswin-v-b@web.de>
Date: Mon, 05 Mar 2012 01:01:43 +0100
Message-id: <[🔎] 87hay39aqg.fsf@frosties.localnet>
Reply-to: Goswin von Brederlow <goswin-v-b@web.de>, 662080@bugs.debian.org
In-reply-to: <[🔎] 20120304123632.GA4041@debian.org> (Julian Andres Klode's message of "Sun, 4 Mar 2012 12:47:27 +0100")
References: <[🔎] 201203040031.21680.timo@tiwe.de> <[🔎] 87lingkis2.fsf@frosties.localnet> <[🔎] 20120304123632.GA4041@debian.org>

Julian Andres Klode <jak@debian.org> writes:

> On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
>> Timo WeingÃ¤rtner <timo@tiwe.de> writes:
>> 
>> > Package: wnpp
>> > Severity: wishlist
>> > X-Debbugs-CC: debian-devel@lists.debian.org
>> >
>> >    Package name: hadori
>> >         Version: 0.2
>> > Upstream Author: Timo WeingÃ¤rtner <timo@tiwe.de>
>> >             URL: https://github.com/tiwe-de/hadori
>> >         License: GPL3+
>> >     Description: Hardlinks identical files
>> >  This might look like yet another hardlinking tool, but it is the only one
>> >  which only memorizes one filename per inode. That results in less merory
>> >  consumption and faster execution compared to its alternatives. Therefore
>> >  (and because all the other names are already taken) it's called
>> >  "HArdlinking DOne RIght".
>> >  .
>> >  Advantages over other hardlinking tools:
>> >   * predictability: arguments are scanned in order, each first version is kept
>> >   * much lower CPU and memory consumption
>> >   * hashing option: speedup on many equal-sized, mostly identical files
>> >
>> > The initial comparison was with hardlink, which got OOM killed with a hundred 
>> > backups of my home directory. Last night I compared it to duff and rdfind 
>> > which would have happily linked files with different st_mtime and st_mode.
>> >
>> > I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
>> > number.
>> >
>> >
>> > Greetings
>> > Timo
>> 
>> I've been thinking about the problem of memory consumption too. But I've
>> come to a different solution. One that doesn't need memory at all.
>
> I know yet another solution. For each file you visit, you simply visit
> the complete tree. Than you have n + 1 visits, but get constant space
> usage.
>
>> 
>> Instead of remembering inodes, filenames and checksums create a global
>> cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
>> and hardlink every file to there. If you want/need to include uid, gid,
>> mtime, mode in there then make that part of the .cache path.
>> 
>> Garbage collection in the cache would be removing all files with a link
>> count of 1.
>> 
>> Going one step further link files with unique size [uid, gid, mtime,
>> ...] to .cache/<size> and change that into .cache/<size>/<start of
>> hash>/<hash> when you find a second file with the same size that isn't
>> identical. That would save on the expensive hashing of clearly unique
>> files.
>
> So implement an object store and replace files outside the object
> store with hardlinks to the store. Yes, this is guaranteed to work
> for some cases, but also has problems. If you create files first, and
> then move them to the store, you still need to check every file with
> link count != 1 and check whether it is in the cache already. And for
> this, you need a lookup by inode if you want to avoid hashing.
>
> And this is basically the same hierarchy as git has:
> 	.git/objects/first 2 hex digits of sha1sum/remaining sha1sum

In the above every file is in the cache. A link count of 1 would
indicate a new file that hasn't been processed yet. Unfortunately you
can also have files with link count != 1 that aren't processed yet,
e.g. 2 new files that are hardlinked to each other.

Detecting wether a file is already in cache or not actualy needs to
check 2 things:

1) link count == 1
   => new file, add to cache
2) link count != 1 but hash of file not known (e.g. extended attribute
   not set)
   => new set of files that are hardlinks to each other

Actually the link count can be completly ignored if you always add a
flag when you've processed a file.

Note: The above wastes time in the 2nd case since it would checksum all
the files that are hardlinks one by one and replace them with hardlinks
into the cache. But you could remember the inode and name of the first
occurance. This would only use up memory proportionally to the number of
new inodes.

>> You could also use a hash that computes the first byte from the first
>> 4k, second byte from 64k, thrid from 1mb and so on. That way you can
>> check if the beginning of 2 files match without having to checksum the
>> whole file or literally comprare the two.
>
> If the beginning can match. They're not guaranteed to match just because
> the hashes match.

This wouldn't be to proof identity but to quickly proof difference. If
the first 4k differ then the file will not match. Only makes sense if
you have a lot of big files of equal size.

MfG
        Goswin

Reply to:

References:
- Bug#662080: ITP: hadori -- Hardlinks identical files
  - From: Timo Weingärtner <timo@tiwe.de>
- Bug#662080: ITP: hadori -- Hardlinks identical files
  - From: Goswin von Brederlow <goswin-v-b@web.de>
- Bug#662080: ITP: hadori -- Hardlinks identical files
  - From: Julian Andres Klode <jak@debian.org>

Prev by Date: Processed: okay these are the same
Next by Date: Bug#518696: [Fwd: Re: Progress?]
Previous by thread: Bug#662080: ITP: hadori -- Hardlinks identical files
Next by thread: Bug#662082: RFP: hda-jack-retask -- small application that makes the kernel interface for jack retasking easy to use
Index(es):
- Date
- Thread