[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#662080: ITP: hadori -- Hardlinks identical files



On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
> Timo Weingärtner <timo@tiwe.de> writes:
> 
> > Package: wnpp
> > Severity: wishlist
> > X-Debbugs-CC: debian-devel@lists.debian.org
> >
> >    Package name: hadori
> >         Version: 0.2
> > Upstream Author: Timo Weingärtner <timo@tiwe.de>
> >             URL: https://github.com/tiwe-de/hadori
> >         License: GPL3+
> >     Description: Hardlinks identical files
> >  This might look like yet another hardlinking tool, but it is the only one
> >  which only memorizes one filename per inode. That results in less merory
> >  consumption and faster execution compared to its alternatives. Therefore
> >  (and because all the other names are already taken) it's called
> >  "HArdlinking DOne RIght".
> >  .
> >  Advantages over other hardlinking tools:
> >   * predictability: arguments are scanned in order, each first version is kept
> >   * much lower CPU and memory consumption
> >   * hashing option: speedup on many equal-sized, mostly identical files
> >
> > The initial comparison was with hardlink, which got OOM killed with a hundred 
> > backups of my home directory. Last night I compared it to duff and rdfind 
> > which would have happily linked files with different st_mtime and st_mode.
> >
> > I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
> > number.
> >
> >
> > Greetings
> > Timo
> 
> I've been thinking about the problem of memory consumption too. But I've
> come to a different solution. One that doesn't need memory at all.

I know yet another solution. For each file you visit, you simply visit
the complete tree. Than you have n + 1 visits, but get constant space
usage.

> 
> Instead of remembering inodes, filenames and checksums create a global
> cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
> and hardlink every file to there. If you want/need to include uid, gid,
> mtime, mode in there then make that part of the .cache path.
> 
> Garbage collection in the cache would be removing all files with a link
> count of 1.
> 
> Going one step further link files with unique size [uid, gid, mtime,
> ...] to .cache/<size> and change that into .cache/<size>/<start of
> hash>/<hash> when you find a second file with the same size that isn't
> identical. That would save on the expensive hashing of clearly unique
> files.

So implement an object store and replace files outside the object
store with hardlinks to the store. Yes, this is guaranteed to work
for some cases, but also has problems. If you create files first, and
then move them to the store, you still need to check every file with
link count != 1 and check whether it is in the cache already. And for
this, you need a lookup by inode if you want to avoid hashing.

And this is basically the same hierarchy as git has:
	.git/objects/first 2 hex digits of sha1sum/remaining sha1sum

> 
> You could also use a hash that computes the first byte from the first
> 4k, second byte from 64k, thrid from 1mb and so on. That way you can
> check if the beginning of 2 files match without having to checksum the
> whole file or literally comprare the two.

If the beginning can match. They're not guaranteed to match just because
the hashes match.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.

Attachment: pgpioZiZzSyVE.pgp
Description: PGP signature


Reply to: