Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?

To: Stefano Zacchiroli <zack@debian.org>
Cc: debconf-discuss@lists.debconf.org
Subject: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
From: Rogério Brito <rbrito@ime.usp.br>
Date: Thu, 15 Aug 2013 22:26:32 -0300
Message-id: <[🔎] CAOtrxKMB1Zgjm2_9aAjjRALUFnXnvZU6CadvvHcs_O40oqrdOQ@mail.gmail.com>
In-reply-to: <[🔎] 20130815223535.GA26786@upsilon.cc>
References: <[🔎] 20130813100610.GA26614@upsilon.cc> <[🔎] CAOtrxKO+t+TDq3MVy6Zo7SWS1iuQYyoY57Ju-7X7yfawU6w0fA@mail.gmail.com> <[🔎] 20130814230842.GA27279@angband.pl> <[🔎] CAK+yWWJFXxjSJT32XMikdbMfKPn66fSzeargyKCCHobtuzZhLg@mail.gmail.com> <[🔎] 20130815223535.GA26786@upsilon.cc>

Hi there, Stefano.

On Thu, Aug 15, 2013 at 7:35 PM, Stefano Zacchiroli <zack@debian.org> wrote:
> On Thu, Aug 15, 2013 at 09:17:01PM +0200, Steve Schnepp wrote:
>> > There are two ways:
>>
>> I'm exploring a third way, entirely userspace (I'll draft a mixed one at
>> the end).
>>
>> For that I'm using the "hardlink" package.
>
> Oh, right! I didn't think about this, but it does in fact perfectly fit
> the sources.d.n use case (deduplication granularity is mostly at the
> file level, file only changes at every 6-hour update, etc).

I'm not Steve, but this is *much* easier than the deduplication of blocks... :)

If you only need to use this coarse deduplication, then take a look at
rdfind, instead of hardlink. Hardlink compares the files that are
likely to be the same (e.g., same size) byte by byte, while rdfind
uses hashes (md5 or sha1, at your option) to compare the files.

Of course, only use rdfind if you can tolerate the (very small) risk
of hash collisions (but, then, the COW filesystems also use hashes for
this thing).

That being said, I usually use hardlink, because I prefer to type long
options with double dashes, instead of rdfind. (OK, that's frivolous).
:) But for the harder work, I use rdfind.

> With how many files have you tried it (max)?

I have tried it with probably much fewer files than you have (about
10^6 files only), but the vast majority of the time is spent with a
cold cache with (spinning) hard disks, for both approaches.

I can elaborate more on this, perhaps in a blog post, if there is interest.

Regards,

-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

Reply to:

Follow-Ups:
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Stefano Zacchiroli <zack@debian.org>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Steve Schnepp <steve.schnepp@munin-monitoring.org>

References:
- [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Stefano Zacchiroli <zack@debian.org>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Rogério Brito <rbrito@ime.usp.br>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Steve Schnepp <steve.schnepp@munin-monitoring.org>
- Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
  - From: Stefano Zacchiroli <zack@debian.org>

Prev by Date: Re: [Debconf-discuss] X Strike Force BoF: main talk room, Aug 16 09:30
Next by Date: [Debconf-discuss] Early swimming BoF
Previous by thread: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
Next by thread: Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
Index(es):
- Date
- Thread