[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On 08.11.2022 05:31, hw wrote:
That still requires you to have enough disk space for at least two full backups.

Correct, if you do always full backups then the second run will consume full backup space in the first place. (not fully correct with bees running -> *)

That would be the first thing I'd address. Even the simplest backup solutions (i.e. based on rsync) do make use of destination rotation and only submitting changes to the backup (-> incremental or differential backups). I never considered successive full backups as a backup "solution".

For me only the first backup is a full backup, every other backup is incremental.

Regarding dedublication, I do see benefits in dedublication either if the user moves files from one directory to some other directory, in partly changed files (my backup solution dedubes on file basis via hardlinks only), and with system backups of several different machines.

I prefer file based backups. So my backup solutions dedublication skills are really limited. But a good block based backup solution can handle all these cases by itself. Then no filesystem based dedublication is needed.

If your problem is only backup related and you are flexible regarding your backup solution, then probably choosing a backup solution with a good dedublication feature should be your best choice. The solution don't has to be complex. Even simple backup solutions like borg backup are fine here (borg: chunk based deduplication even of parts of files across several backups of several different machines). Even your criteria to not write duplicate data in the first place is fulfilled here.

(see borgbackup in Debian repository; disclaimer: I do not have personal experience with borg as I'm using other solutions)

I wouldn't mind running it from time to time, though I don't know that I would have a lot of duplicate data other than backups. How much space might I expect to gain from using bees, and how much memory does it require to run?

Bees should run as a service 24/7 and catches all written data right after it gets written. That's comparable to in-band dedublication even if it's out-of-band by definition. (*) This way writing many duplicate files will potentially result in removing duplicates even if not all data has already written to disk.

Therefore also memory consumption is like with in-band deduplication (ZFS...), which means you should reserve more than 1 GB RAM per 1 TB data. But it's flexible. Even less memory is usable. But then it cannot find all duplicates as the hash table of all the data doesn't fit into memory. (Nevertheless even then dedublication is more efficient than expected: if it finds some duplicate block it looks for any blocks around this block. So for big files only one match in the hash table is sufficient to dedublicate the whole file.)

regards
hede


Reply to: