[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How git performs when you throw all of Debian at it



On Sat, Aug 31, 2013 at 12:32:47AM +0100, Dmitrijs Ledkovs wrote:
> On 30 August 2013 20:55, Steven Chamberlain <steven@pyro.eu.org> wrote:
> > Hi,
> >
> >> [...] using git instead of the file system for storing the contents
> >> of Debian Code Search. The hope was that it would lead to fewer disk
> >> seeks and less data due to gits delta-encoding
> >
> > Wouldn't ZFS be a more natural way to do something like this?
> >
> > A choice of gzip, lzjb and more recently lz4 compression;  snapshots
> > and/or deduplication both reduce the amount of disk blocks and cache
> > memory needed.
> >
> > I've pondered before at this overlap in functionality between packing by
> > Git, and those features of the ZFS filesystem.  They are doing much the
> > same thing but with different granularity.  It would be neat if they
> > could work together better.
> 
> I haven't finished packaging bedup - btrfs deduplication tool.

bedup is only an userspace tool that calculates per-file hashes, then uses
chattr tricks to avoid a race condition if some other process tried to write
to the file.  git renders the first part not needed: hashes are already
known.  If you're the only writer, you don't need to care about write races
either.

> Anybody have benchmarked that, if that's any good and/or comparable to zfs
> deduplication?

It's an apples to microsofts comparison: zfs takes a massive amount of
memory to store block hashes.  This has an upside: duplicated data never
hits the actual disk, and a downside: the memory cannot be used for
anything else, and if hashes hit the disk things become really slow.
With btrfs, unless you know the hash beforehand (git), deduplication
works after a write.  This might be months later (one-shot), during the
night (cron) or, if *notify is used, a small fraction of second later.
btrfs can enumerate recently changed blocks for you so there's no need
to read the whole disk in that cron job.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


Reply to: