Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)

To: debian-user@lists.debian.org
Subject: Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)
From: "Thomas Schmitt" <scdbackup@gmx.net>
Date: Tue, 08 Nov 2022 11:11:07 +0100
Message-id: <[🔎] 21954399441927384958@scdbackup.webframe.org>
In-reply-to: <[🔎] df5a10a325537cae9c199b618d5179f7b3d00ac1.camel@adminart.net>
References: <[🔎] df5a10a325537cae9c199b618d5179f7b3d00ac1.camel@adminart.net>

Hi,

hw wrote:
> I still wonder how VDO actually works.

There is a comparer/decider named UDS which holds an index of the valid
storage blocks, and a device driver named VDO which performes the
deduplication and hides its internals from the user by providing a
block device on top of the real storage device file.
  https://www.marksei.com/vdo-linux-deduplication/

> if I have a volume with 5TB of data on it and I write a 500kB file to that
> volume a week later or whenever, and the file I'm writing is identical to
> another file somewhere within the 5TB of data alreading on the volume, how
> does VDO figure out that both files are identical?

I understand that it chops your file into 4 KiB blocks
  https://github.com/dm-vdo/kvdo/issues/18
and lets UDS look up the checksum of each such block in the index. If a
match is found, then the new block is not stored as itself but only as
reference to the found block.

This might yield more often deduplication than if the file was looked as a
whole. But i still have doubts that this would yield much advantage with
my own data.
Main obstacle for partial matches is probably the demand for 4 KiB alignment.
Neither text oriented files nor compressed files will necessarily hold their
identical file parts with that alignment. Any shift of not exactly 4 KiB
would make the similiarity invisible to UDS/VDO.

didier gaumet wrote:
> The goal being primarily to optimize storage space
> for a provider of networked virtual machines to entities or customers

Deduplicating over several nearly identical filesystem images might indeed
bring good size reduction.

hw wrote:
> When I want to have 2 (or more) generations of backups, do I actually want
> deduplication?

Deduplication reduces uncontrolled redundancy, backups shall create
controlled redundancy. So both are not exactly contrary in their goal
but surely need to be coordinated.

In case of VDO i expect that you need to use different deduplicating
devices to get controlled redundancy.
I do similar with incremental backups with file granularity. My backup
Blu-rays hold 200+ sessions which mostly re-use the file data storage
of previous sessions. If a bad spot damages file content, then it is
damaged in all sessions which refer to it.
To reduce the probability of such a loss, i run several backups per day,
each on a separate BD disc.

From time to time i make verification runs on the backups discs in order
to check for any damage. It is extreme rare to find a bas spot after the
written session was verfied directly after being written.
(The verification is based on MD5 checksums, which i deem sufficient,
because my use case avoids the birthday paradox of probability theory.
UDS/VDO looks like a giant birthday party. So i assume that it uses larger
checksums or verifies content identy when checksums match.)

Have a nice day :)

Thomas

Reply to:

Follow-Ups:
- Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)
  - From: hw <hw@adminart.net>

References:
- Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)
  - From: hw <hw@adminart.net>

Prev by Date: Increased read IO wait times after Bullseye upgrade
Next by Date: Re: deduplicating file systems: VDO with Debian?
Previous by thread: Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)
Next by thread: Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)
Index(es):
- Date
- Thread