[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?



On Thu, Aug 15, 2013 at 09:17:01PM +0200, Steve Schnepp wrote:
> Le 15 août 2013 01:08, "Adam Borowski" <kilobyte@angband.pl> a écrit :
> 
> First, some context : I'm trying to efficiently store a huge nfs-root farm
> (objective: 500+), therefore the perf impact should be very limited.
> 
> I don't bother dedup inside files. I'm only aiming at deduping whole files.

> I'm exploring a third way, entirely userspace (I'll draft a mixed one at
> the end).
> 
> For that I'm using the "hardlink" package.

So you're using a filesystem with the capability you want, without actually
using it.  The "clone" ioctl does a copy-on-write link, without changes of
semantics involved with hard links.
 
> It does operate on the file level, in userspace. It has some serious
> drawbacks :
> * bugged on a race condition when changing files

bedup lessens this flaw by briefy marking the file as immutable.  This means
a write attempt when the file is being linked results in a permission
failure rather than data loss.  Lower level support is needed to eliminate
races entirely.

> * once hardlinked, one cannot write in the file anymore. It has to be
> written as a new one then replaced using rename(2).

You said you're on btrfs.  If for some reason you want to be filesystem
independent, you can instead go with iunlink, a hack included in the vserver
patch.  It adds a new xattr file flag that causes marked hardlinks to cow
the file when you try to write to it.  This is less efficient that btrfs:
it operates on a file rather than block level, but works with most
filesystems.  You just need to apply a kernel patch.
 
> Despite these it has huge benefits :
> * Nil performance impact.

Not nil, it's in the best case same as both of btrfs ways.  Ie, nil during
read, but you need to actually hash the files.

Oh, and unlike btrfs, you need to hash EVERYTHING rather than just modified
files.  And as anyone who tried to rsync a big disk knows, even statting is
often a matter of half an hour+ on rotational disks.  A btrfs-aware tool
can enumerate changed files without having to stat anything else.

> * Asynchronous (offline) deduplication can be scheduled off peak hours.

Same.

> * usable in old-stable kernels

Ok, that's one point.  If you insist to run an old kernel, though, you'd
still want at least iunlink so you don't risk data corruption with
hardlinks.  Because if something writes to a file that used to be identical,
the other copy is overwritten as well.

> > * a nice and clean way.  The kernel interface would need to be "hey
> > kernel,
> >   I think the block in fd 1 offset 0 might be same as a block in fd 2
> > offset
> >   4096, care to compare and perhaps combine them?".
> 
> So all the cleverness of *what* to merge would only happen in userspace ?

Yes, in all of ways listed so far.

BTW, it turns out the syscall I mentioned is on its way to getting merged
after all: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26331

> What would be the impact of a runtime read ?

Nil, obviously.  That's the whole point of built-in cow.

You want noatime, obviously, but it strange it's not the default everywhere
already.

> > Offline (a confusing name, it's a mounted filesystem but at a later time)
> 
> It can even be done asynchronously, by registering an inotify on it, and
> then queuing the dedups in a userspace daemon.

That'd be effectively online dedup, with all usual flaws but costing you an
extra write + read compared to ZFS (assuming equally competent
implementations, which might be not true).

> > And the best of all, the kernel needs just a single syscall, with all the
> > complexity done in userspace.
> 
> That's the whole beauty of it. You then create a whole ecosystem of
> softwares to address that complexity in every different manner possible :)

The risky part is in the kernel, userspace can't destroy data here, at most
be less efficient than possible.

> Now, the mixed approach I promised earlier :
> 
> As pure userspace take is not ideal, I was thinking about adding a FUSE
> in-place layer than would synchronously copy deduped-hardlinks on write.
> Could be triggered by a open(w) or a real write().

Ie, iunlink but with an extra layer of kernel<->user copying, which carries
a serious performance loss.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Reply to: