Re: How git performs when you throw all of Debian at it
On Sat, Aug 31, 2013 at 12:32:47AM +0100, Dmitrijs Ledkovs wrote:
> On 30 August 2013 20:55, Steven Chamberlain <steven@pyro.eu.org> wrote:
> > Hi,
> >
> >> [...] using git instead of the file system for storing the contents
> >> of Debian Code Search. The hope was that it would lead to fewer disk
> >> seeks and less data due to gits delta-encoding
> >
> > Wouldn't ZFS be a more natural way to do something like this?
> >
> > A choice of gzip, lzjb and more recently lz4 compression; snapshots
> > and/or deduplication both reduce the amount of disk blocks and cache
> > memory needed.
> >
> > I've pondered before at this overlap in functionality between packing by
> > Git, and those features of the ZFS filesystem. They are doing much the
> > same thing but with different granularity. It would be neat if they
> > could work together better.
>
> I haven't finished packaging bedup - btrfs deduplication tool.
bedup is only an userspace tool that calculates per-file hashes, then uses
chattr tricks to avoid a race condition if some other process tried to write
to the file. git renders the first part not needed: hashes are already
known. If you're the only writer, you don't need to care about write races
either.
> Anybody have benchmarked that, if that's any good and/or comparable to zfs
> deduplication?
It's an apples to microsofts comparison: zfs takes a massive amount of
memory to store block hashes. This has an upside: duplicated data never
hits the actual disk, and a downside: the memory cannot be used for
anything else, and if hashes hit the disk things become really slow.
With btrfs, unless you know the hash beforehand (git), deduplication
works after a write. This might be months later (one-shot), during the
night (cron) or, if *notify is used, a small fraction of second later.
btrfs can enumerate recently changed blocks for you so there's no need
to read the whole disk in that cron job.
--
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ
Reply to: