Re: deduplicating file systems: VDO with Debian?
On 08.11.2022 05:31, hw wrote:
That still requires you to have enough disk space for at least two full
backups.
Correct, if you do always full backups then the second run will consume
full backup space in the first place. (not fully correct with bees
running -> *)
That would be the first thing I'd address. Even the simplest backup
solutions (i.e. based on rsync) do make use of destination rotation and
only submitting changes to the backup (-> incremental or differential
backups). I never considered successive full backups as a backup
"solution".
For me only the first backup is a full backup, every other backup is
incremental.
Regarding dedublication, I do see benefits in dedublication either if
the user moves files from one directory to some other directory, in
partly changed files (my backup solution dedubes on file basis via
hardlinks only), and with system backups of several different machines.
I prefer file based backups. So my backup solutions dedublication skills
are really limited. But a good block based backup solution can handle
all these cases by itself. Then no filesystem based dedublication is
needed.
If your problem is only backup related and you are flexible regarding
your backup solution, then probably choosing a backup solution with a
good dedublication feature should be your best choice. The solution
don't has to be complex. Even simple backup solutions like borg backup
are fine here (borg: chunk based deduplication even of parts of files
across several backups of several different machines). Even your
criteria to not write duplicate data in the first place is fulfilled
here.
(see borgbackup in Debian repository; disclaimer: I do not have personal
experience with borg as I'm using other solutions)
I wouldn't mind running it from time to time, though I don't know that
I
would have a lot of duplicate data other than backups. How much space
might I
expect to gain from using bees, and how much memory does it require to
run?
Bees should run as a service 24/7 and catches all written data right
after it gets written. That's comparable to in-band dedublication even
if it's out-of-band by definition. (*) This way writing many duplicate
files will potentially result in removing duplicates even if not all
data has already written to disk.
Therefore also memory consumption is like with in-band deduplication
(ZFS...), which means you should reserve more than 1 GB RAM per 1 TB
data. But it's flexible. Even less memory is usable. But then it cannot
find all duplicates as the hash table of all the data doesn't fit into
memory. (Nevertheless even then dedublication is more efficient than
expected: if it finds some duplicate block it looks for any blocks
around this block. So for big files only one match in the hash table is
sufficient to dedublicate the whole file.)
regards
hede
Reply to: