Re: How git performs when you throw all of Debian at it

To: debian-devel@lists.debian.org
Subject: Re: How git performs when you throw all of Debian at it
From: Adam Borowski <kilobyte@angband.pl>
Date: Sat, 31 Aug 2013 02:16:54 +0200
Message-id: <[🔎] 20130831001654.GA16662@angband.pl>
In-reply-to: <[🔎] CANBHLUiBDVtsHc5dBtwObf67xdEXnJPUMaFcPvX1auj37tERhg@mail.gmail.com>
References: <[🔎] 5220F898.6000107@pyro.eu.org> <[🔎] CANBHLUiBDVtsHc5dBtwObf67xdEXnJPUMaFcPvX1auj37tERhg@mail.gmail.com>

On Sat, Aug 31, 2013 at 12:32:47AM +0100, Dmitrijs Ledkovs wrote:
> On 30 August 2013 20:55, Steven Chamberlain <steven@pyro.eu.org> wrote:
> > Hi,
> >
> >> [...] using git instead of the file system for storing the contents
> >> of Debian Code Search. The hope was that it would lead to fewer disk
> >> seeks and less data due to gits delta-encoding
> >
> > Wouldn't ZFS be a more natural way to do something like this?
> >
> > A choice of gzip, lzjb and more recently lz4 compression;  snapshots
> > and/or deduplication both reduce the amount of disk blocks and cache
> > memory needed.
> >
> > I've pondered before at this overlap in functionality between packing by
> > Git, and those features of the ZFS filesystem.  They are doing much the
> > same thing but with different granularity.  It would be neat if they
> > could work together better.
> 
> I haven't finished packaging bedup - btrfs deduplication tool.

bedup is only an userspace tool that calculates per-file hashes, then uses
chattr tricks to avoid a race condition if some other process tried to write
to the file.  git renders the first part not needed: hashes are already
known.  If you're the only writer, you don't need to care about write races
either.

> Anybody have benchmarked that, if that's any good and/or comparable to zfs
> deduplication?

It's an apples to microsofts comparison: zfs takes a massive amount of
memory to store block hashes.  This has an upside: duplicated data never
hits the actual disk, and a downside: the memory cannot be used for
anything else, and if hashes hit the disk things become really slow.
With btrfs, unless you know the hash beforehand (git), deduplication
works after a write.  This might be months later (one-shot), during the
night (cron) or, if *notify is used, a small fraction of second later.
btrfs can enumerate recently changed blocks for you so there's no need
to read the whole disk in that cron job.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Reply to:

References:
- Re: How git performs when you throw all of Debian at it
  - From: Steven Chamberlain <steven@pyro.eu.org>
- Re: How git performs when you throw all of Debian at it
  - From: Dmitrijs Ledkovs <xnox@debian.org>

Prev by Date: Re: Introducing dgit - git integration with the Debian archive
Next by Date: Bug#721392: ITP: acr -- autoconf like tool
Previous by thread: Re: How git performs when you throw all of Debian at it
Next by thread: Bug#721392: ITP: acr -- autoconf like tool
Index(es):
- Date
- Thread