Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?

To: debconf-discuss@lists.debconf.org
Subject: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
From: Adam Borowski <kilobyte@angband.pl>
Date: Thu, 15 Aug 2013 01:08:42 +0200
Message-id: <[🔎] 20130814230842.GA27279@angband.pl>
In-reply-to: <[🔎] CAOtrxKO+t+TDq3MVy6Zo7SWS1iuQYyoY57Ju-7X7yfawU6w0fA@mail.gmail.com>
References: <[🔎] 20130813100610.GA26614@upsilon.cc> <[🔎] CAOtrxKO+t+TDq3MVy6Zo7SWS1iuQYyoY57Ju-7X7yfawU6w0fA@mail.gmail.com>

On Tue, Aug 13, 2013 at 11:04:31PM -0300, Rogério Brito wrote:
> On Tue, Aug 13, 2013 at 7:06 AM, Stefano Zacchiroli <zack@debian.org> wrote:
> > Heya,
> >   for http://sources.debian.net (and to implement pabs' crazy ideas ---
> > it's always pabs' fault!) I'm playing with btrfs deduplicaiton [1].
> 
> Geee, I really miss not being able to attend this year's DebConf, as
> deduplication is something that is very dear to my heart.
> 
> As soon as btrfs supports deduplication (offline is better for my use
> case), I will seriously consider switching.

There are two ways:
* a hacky way.  Works with existing kernels, has some quirks, I asked for
  its inclusion in #703169 (problems are listed there).
* a nice and clean way.  The kernel interface would need to be "hey kernel,
  I think the block in fd 1 offset 0 might be same as a block in fd 2 offset
  4096, care to compare and perhaps combine them?".  There's no problem with
  races, locking to avoid them, etc, userland could use hashes as short as
  64 bits[1], etc.  I think someone even posted a patch, but it got lost in
  a flamewar between developers.

> Online deduplication, at least with ZFS, was catastrophic in my experience

Online (ie, at write time) has some benefits in theory, but it has MASSIVE
memory/disk costs.  Even with a large block size, it takes several GB for
a not so big filesystem.  There are so many ways to use that memory better.

Offline (a confusing name, it's a mounted filesystem but at a later time)
costs you a write + read, but can be done at your leisure, perhaps one-shot,
perhaps incrementally by saving the hash table, can be done with short
hashes, don't use any memory the rest of the time, etc.

[1] It could be done with a traditional hash table, but here's my algorithm:

Blocks need to be indexable by integers from a limited range.  Let's say, a
2TB disk with 4KB blocks -> 2^29 slots.  A hash table with in-table chains
(+1, +i^2, double hashed, etc) can be done with no extra memory (unlike one
with bucket lists) but needs to know the max occupancy beforehand (# of
blocks in the filesystem) and degrades once it's more than ~80% full.  With
64 bit elements, that costs us 5GB (if you can't afford that, use bigger
blocks; that'd be still much better than ZFS' not working with files smaller
than 128KB).

Calculate your favourite hash for every block on the disk.  It needs to
include a secret salt to prevent DOS attacks.  Take a part of the block hash
to be the index into our hash table, 29 bits here.  The entry consists of
the block's index and 35 other bits of the block hash.  If the # of blocks
is not a power of two and you like to show off, you can even store a partial
bit (how: an exercise to the reader :))!  If you get a collision, check
those 35 bits of the hash.  If they match, you got a potential match, you
can tell the kernel to compare the blocks on the disk (one will be already
in the page cache).  If they don't, you follow the hash chain.

The number of false positives should be acceptable until the disk size
reaches well into hundreds of TB (with 4KB blocks).  Enlarging the block
size gives you more wiggle room (at the cost of finding fewer duplicates).
But hey, we can use 65 bit entries!

The hash table can be saved to the disk, to be reused by an incremental run.

Another modification would be doing multiple passes, giving every pass 1/N
of the hash space, discarding hashes outside the range.  End result: 1/N the
speed, 1/N the memory needed.

Using a binary tree instead of a hash table would increase memory costs
(say, 14 bytes instead of 8 per entry) but would allow saving parts of the
tree to the disk (with linear access!), allowing less computationally
demanding multi-pass runs.

See how much fun can we have with data structures?  And the best of all,
the kernel needs just a single syscall, with all the complexity done in
userspace.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Reply to:

Follow-Ups:
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Steve Schnepp <steve.schnepp@munin-monitoring.org>

References:
- [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Stefano Zacchiroli <zack@debian.org>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Rogério Brito <rbrito@ime.usp.br>

Prev by Date: [Debconf-discuss] Debconf Massage BoF
Next by Date: Re: [Debconf-discuss] git-buildpackage
Previous by thread: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
Next by thread: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
Index(es):
- Date
- Thread