[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)



On Tue, 2022-11-08 at 11:11 +0100, Thomas Schmitt wrote:
> Hi,
> 
> hw wrote:
> > I still wonder how VDO actually works.
> 
> There is a comparer/decider named UDS which holds an index of the valid
> storage blocks, and a device driver named VDO which performes the
> deduplication and hides its internals from the user by providing a
> block device on top of the real storage device file.
>   https://www.marksei.com/vdo-linux-deduplication/
> 

And how come that it doesn't require as much memory as ZFS seems to need for
deduplication?  Apparently, ZFS uses either 128kB or variable block sizes[1] and
could use much less memory than VDO would have to because VDO uses much smaller
blocks.


[1]: https://en.wikipedia.org/wiki/ZFS#Variable_block_sizes

> > if I have a volume with 5TB of data on it and I write a 500kB file to that
> > volume a week later or whenever, and the file I'm writing is identical to
> > another file somewhere within the 5TB of data alreading on the volume, how
> > does VDO figure out that both files are identical?
> 
> I understand that it chops your file into 4 KiB blocks
>   https://github.com/dm-vdo/kvdo/issues/18
> and lets UDS look up the checksum of each such block in the index. If a
> match is found, then the new block is not stored as itself but only as
> reference to the found block.

So the VDO ppl say 4kB is a good block size and larger blocks would suck for
performance.  Does ZFS suck for performance because it uses larger block sizes,
and why doesn't ZFS use the smaller block sizes when those are the most
advantagous ones?

> This might yield more often deduplication than if the file was looked as a
> whole. But i still have doubts that this would yield much advantage with
> my own data.
> Main obstacle for partial matches is probably the demand for 4 KiB alignment.
> Neither text oriented files nor compressed files will necessarily hold their
> identical file parts with that alignment. Any shift of not exactly 4 KiB
> would make the similiarity invisible to UDS/VDO.

Deduplication doesn't work when files aren't sufficiently identical, no matter
what block size is used for comparing.  It seems to make sense that the larger
the blocks are, the lower chances are that two blocks are identical.

So how come that deduplication with ZFS works at all?  The large block sizes
would prevent that.  Maybe it doesn't work well enough to be worth it?

Is ZFS compressing blocks or files when compression is enabled?  Using variable
block sizes when compression is enabled might indicate that it compresses
blocks.

> didier gaumet wrote:
> > The goal being primarily to optimize storage space
> > for a provider of networked virtual machines to entities or customers
> 
> Deduplicating over several nearly identical filesystem images might indeed
> bring good size reduction.

Well, it's independant of the file system.  For VM images on whatever file
system or for N copies of the same backup differing only by the time the backup
was made, I don't see why both shouldn't work well.

> hw wrote:
> > When I want to have 2 (or more) generations of backups, do I actually want
> > deduplication?
> 
> Deduplication reduces uncontrolled redundancy, backups shall create
> controlled redundancy. So both are not exactly contrary in their goal
> but surely need to be coordinated.

That's a really nice way to put it.  Do I want/need controlled redundancy with
backups on the same machine, or is it better to use snapshots and/or
deduplication to reduce the controlled redundancy?

> In case of VDO i expect that you need to use different deduplicating
> devices to get controlled redundancy.

How would the devices matter?  It's the volume residing on devices that gets
deduplicated, not the devices.

> I do similar with incremental backups with file granularity. My backup
> Blu-rays hold 200+ sessions which mostly re-use the file data storage
> of previous sessions. If a bad spot damages file content, then it is
> damaged in all sessions which refer to it.
> To reduce the probability of such a loss, i run several backups per day,
> each on a separate BD disc.
> 
> From time to time i make verification runs on the backups discs in order
> to check for any damage. It is extreme rare to find a bas spot after the
> written session was verfied directly after being written.
> (The verification is based on MD5 checksums, which i deem sufficient,
> because my use case avoids the birthday paradox of probability theory.
> UDS/VDO looks like a giant birthday party. So i assume that it uses larger
> checksums or verifies content identy when checksums match.)

How can you make backups on Bluerays?  They hold only 50GB or so and I'd need
thousands of them.  Do you have an automatic changer that jiggles with 10000
DVDs or so? :)


Reply to: