Re: deduplicating file systems: VDO with Debian?

To: debian-user@lists.debian.org
Subject: Re: deduplicating file systems: VDO with Debian?
From: hw <hw@adminart.net>
Date: Wed, 09 Nov 2022 13:52:26 +0100
Message-id: <[🔎] 1059c0b363074c3015b398d59cbd66e67c5d222f.camel@adminart.net>
In-reply-to: <[🔎] 5debce39cbafd85a8f59b6d2d131c1b3@der-he.de>
References: <[🔎] 45998a8cdc61d0945fc5907bc8b30b697b3f5703.camel@adminart.net> <[🔎] 76f44e8de3b785bf4d3c6f2d33f487d4@der-he.de> <[🔎] ae59abfba405f07d46342aa3e9ac0841ed1ccb9e.camel@adminart.net> <[🔎] 5debce39cbafd85a8f59b6d2d131c1b3@der-he.de>

On Tue, 2022-11-08 at 15:07 +0100, hede wrote:
> On 08.11.2022 05:31, hw wrote:
> > That still requires you to have enough disk space for at least two full 
> > backups.
> 
> Correct, if you do always full backups then the second run will consume 
> full backup space in the first place. (not fully correct with bees 
> running -> *)

Does that work?  Does bees run as long as there's something to deduplicate and
only stops when there isn't?  I thought you start it when the data is place and
not before that.

> That would be the first thing I'd address. Even the simplest backup 
> solutions (i.e. based on rsync) do make use of destination rotation and 
> only submitting changes to the backup (-> incremental or differential 
> backups). I never considered successive full backups as a backup 
> "solution".

You can easily make changes to two full copies --- "make changes" meaning that
you only change what has been changed since last time you made the backup.

> For me only the first backup is a full backup, every other backup is 
> incremental.

When you make a second full backup, that second copy is not incremental.  It's a
full backup.

> Regarding dedublication, I do see benefits in dedublication either if 
> the user moves files from one directory to some other directory, in 
> partly changed files (my backup solution dedubes on file basis via 
> hardlinks only), and with system backups of several different machines.

But not with copies?

> I prefer file based backups. So my backup solutions dedublication skills 
> are really limited. But a good block based backup solution can handle 
> all these cases by itself. Then no filesystem based dedublication is 
> needed.

What difference does it make wether the deduplication is block based or somehow
file based (whatever that means).

> If your problem is only backup related and you are flexible regarding 
> your backup solution, then probably choosing a backup solution with a 
> good dedublication feature should be your best choice. The solution 
> don't has to be complex. Even simple backup solutions like borg backup 
> are fine here (borg: chunk based deduplication even of parts of files 
> across several backups of several different machines). Even your 
> criteria to not write duplicate data in the first place is fulfilled 
> here.

I'm flexible, but I distrust "backup solutions".

> (see borgbackup in Debian repository; disclaimer: I do not have personal 
> experience with borg as I'm using other solutions)
> 
> >  I wouldn't mind running it from time to time, though I don't know that 
> > I
> > would have a lot of duplicate data other than backups.  How much space 
> > might I
> > expect to gain from using bees, and how much memory does it require to 
> > run?
> 
> Bees should run as a service 24/7 and catches all written data right 
> after it gets written. That's comparable to in-band dedublication even 
> if it's out-of-band by definition. (*) This way writing many duplicate 
> files will potentially result in removing duplicates even if not all 
> data has already written to disk.
> 
> Therefore also memory consumption is like with in-band deduplication 
> (ZFS...), which means you should reserve more than 1 GB RAM per 1 TB 
> data. But it's flexible. Even less memory is usable. But then it cannot 
> find all duplicates as the hash table of all the data doesn't fit into 
> memory. (Nevertheless even then dedublication is more efficient than 
> expected: if it finds some duplicate block it looks for any blocks 
> around this block. So for big files only one match in the hash table is 
> sufficient to dedublicate the whole file.)

Sounds good.  Before I try it, I need to make a backup in case something goes
wrong.

Reply to:

Follow-Ups:
- Re: deduplicating file systems: VDO with Debian?
  - From: hede <debian452@der-he.de>

References:
- deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>
- Re: deduplicating file systems: VDO with Debian?
  - From: hede <debian452@der-he.de>
- Re: deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>
- Re: deduplicating file systems: VDO with Debian?
  - From: hede <debian452@der-he.de>

Prev by Date: Re: deduplicating file systems: VDO with Debian?
Next by Date: Re: deduplicating file systems: VDO with Debian?
Previous by thread: Re: deduplicating file systems: VDO with Debian?
Next by thread: Re: deduplicating file systems: VDO with Debian?
Index(es):
- Date
- Thread