[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Backup Times on a Linux desktop



deloptes writes:

Alessandro Baggi wrote:

> Borg seems very promising but I performs only push request at the moment
> and I need pull request. It offers deduplication, encryption and much
> more.
>
> One word on deduplication: it is a great feature to save space, with
> deduplication compression ops (that could require much time) are avoided
> but remember that with deduplication for multiple backups only one
> version of this files is deduplicated. So if this file get corrupted
> (for every reason) it will be compromised on all  previous backups jobs
> performed, so the file is lost. For this I try to avoid deduplication on
> important backup dataset.

Not sure if true - for example you make daily, weekly and monthly backups
(classical) Lets focus on the daily part. On day 3 the files is broken.
You have to recover from day 2. The file is not broken for day 2 - correct?!

[...]

I'd argue that you are both right about this. It just depends on where the
file corruption occurs.

Consider a deduplicated system which stores backups in /fs/backup and reads
the input files from /fs/data. Then if a file in /fs/data is corrupted, you
could always extract it from the backup successfully. If that file were
changed and corrupted, the backup system would no longer consider it a
"duplicate" and thus store the corrupted content of the file as a new
version. Effectively, while the newest version of the file is corrupted and
thus not useful, it is still possible to recover the old version of the file
from the (deduplicated or not) backup.

The other consideration is a corruption on the backup storage volume like
some files in /fs/backup go bad. In a deduplicated setting, if a single
piece of data in /fs/backup corresponds to a lot of restored files with the
same contents, all of these files are no longer successfully recoverable,
because the backup's internal structure contains corrupted data.

In a non-deduplicated (so to say: redundant) backup system, if parts of the
backup store become corrupted, the damage is likely (but not necessarily)
restricted to only some files upon restoration and as there is no
deduplication, it is likely that the "amount of data non-restorable" is
somehow related to the "amount of data corrupted"...

as these considerations about a corrupted backup store are mostly on such a
blurry level as described, the benefit from avoiding deduplication because
of the risk of losing more files upon corruption of the backup store is
possibly limited. However, given some concrete systems, the picture might
change entirely. A basic file-based (e.g. rsync) backup is as tolerant to
corruption as the original "naked" files. For any system maintaining its own
filesystem, the respective system needs to be studied extensively to find
out how partial corruption affects restorability. In theory, it could have
additional redundancy data to restore files even in the presence of a
certain level of corruption (e.g. in percent bytes changed or similar).

This whole thing was actually a reason for writing my own system: File-based
rsync-backup was slow, space inefficient and did not provide encryption.
However, more advanced systems (like borg, obnam?) split files into
multiple chunks and maintain their own filesystem. For me it is not really
obvious how a partially corrupted backup restores with these systems. For
my tool, I chose an approach between these: I store only "whole" files and
do not deduplicate them in any way. However, I put multipls small files into
archives such that I can compress and encrypt them. In my case, a partial
corruption would exactly lose the files from the corrupted archives which
establishes a relation between the amount of data corrupted and lost
(although in the worst case: "each archive slightly corrupted", all is
lost... to avoid that one needs error correction, but my tool does not do it
[yet?])

HTH
Linux-Fan


Reply to: