[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On Tue, 2022-11-08 at 15:07 +0100, hede wrote:
> On 08.11.2022 05:31, hw wrote:
> > That still requires you to have enough disk space for at least two full 
> > backups.
> 
> Correct, if you do always full backups then the second run will consume 
> full backup space in the first place. (not fully correct with bees 
> running -> *)

Does that work?  Does bees run as long as there's something to deduplicate and
only stops when there isn't?  I thought you start it when the data is place and
not before that.

> That would be the first thing I'd address. Even the simplest backup 
> solutions (i.e. based on rsync) do make use of destination rotation and 
> only submitting changes to the backup (-> incremental or differential 
> backups). I never considered successive full backups as a backup 
> "solution".

You can easily make changes to two full copies --- "make changes" meaning that
you only change what has been changed since last time you made the backup.

> For me only the first backup is a full backup, every other backup is 
> incremental.

When you make a second full backup, that second copy is not incremental.  It's a
full backup.

> Regarding dedublication, I do see benefits in dedublication either if 
> the user moves files from one directory to some other directory, in 
> partly changed files (my backup solution dedubes on file basis via 
> hardlinks only), and with system backups of several different machines.

But not with copies?

> I prefer file based backups. So my backup solutions dedublication skills 
> are really limited. But a good block based backup solution can handle 
> all these cases by itself. Then no filesystem based dedublication is 
> needed.

What difference does it make wether the deduplication is block based or somehow
file based (whatever that means).

> If your problem is only backup related and you are flexible regarding 
> your backup solution, then probably choosing a backup solution with a 
> good dedublication feature should be your best choice. The solution 
> don't has to be complex. Even simple backup solutions like borg backup 
> are fine here (borg: chunk based deduplication even of parts of files 
> across several backups of several different machines). Even your 
> criteria to not write duplicate data in the first place is fulfilled 
> here.

I'm flexible, but I distrust "backup solutions".

> (see borgbackup in Debian repository; disclaimer: I do not have personal 
> experience with borg as I'm using other solutions)
> 
> >  I wouldn't mind running it from time to time, though I don't know that 
> > I
> > would have a lot of duplicate data other than backups.  How much space 
> > might I
> > expect to gain from using bees, and how much memory does it require to 
> > run?
> 
> Bees should run as a service 24/7 and catches all written data right 
> after it gets written. That's comparable to in-band dedublication even 
> if it's out-of-band by definition. (*) This way writing many duplicate 
> files will potentially result in removing duplicates even if not all 
> data has already written to disk.
> 
> Therefore also memory consumption is like with in-band deduplication 
> (ZFS...), which means you should reserve more than 1 GB RAM per 1 TB 
> data. But it's flexible. Even less memory is usable. But then it cannot 
> find all duplicates as the hash table of all the data doesn't fit into 
> memory. (Nevertheless even then dedublication is more efficient than 
> expected: if it finds some duplicate block it looks for any blocks 
> around this block. So for big files only one match in the hash table is 
> sufficient to dedublicate the whole file.)

Sounds good.  Before I try it, I need to make a backup in case something goes
wrong.


Reply to: