[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



Curt wrote: 
> On 2022-11-08, DdB <debianlist@potentially-spam.de-bruyn.de> wrote:
> >> 
> > Your wording likely confuses 2 different concepts:
> >
> > Deduplication avoids storing identical data more than once.
> > whereas
> > Redundancy stores information on more than one place on purpose to avoid
> > loos of data in case of havoc.
> 
> So they're antithetical concepts? Redundancy sounds a lot like a back
> up.


Think of it this way:

You have some data that you want to protect against the machine
dying.

So you copy it to another machine. Now you have a backup.

You need to do this repeatedly, or else your backup is stale:
lacking information that was recently changed.

If you do it repeatedly to the same target, that's a lot of
information. Maybe you can only send the changes? rsync, ZFS
send, and some other methods make that pretty easy.

But what if you accidentally deleted a file a week ago, and the
backups are done every night? You're out of luck... unless you
have somehow got a record of all the changes that you saved, or
you have a second backup that happened before the deletion.

Snapshots (rsnapshot, ZFS snapshots, others...) make it easy to
go back in time to any snapshot and retrieve the state of the
data then, while not storing full copies of all the data all the
time.

Now, let's suppose that you want your live data -- the source --
to withstand a disk dying. If all the data is on one disk,
that's not going to happen. You can stripe the data on N disks,
but since there's only one copy of any given chunk of data, that
doesn't help with resiliency to a disk failure.

Instead, you can make multiple complete copies every time you do
a write: disk mirroring, or RAID 1. This is very fast, but eats
twice the disk space.

If you can accept slower performance, you can write the data in
chunks to N disks, and write checksums calculated from that data
to M disks, such that any 1 disk of the N+M can fail and you can
still reconstruct the whole data. That's RAID 5. 

A slightly more complicated calculation withstands any 2 disks of
the N+M - RAID 6. ZFS even has a three disk resiliency mode.

Depending on your risk tolerance and performance needs, you
might use RAID 10 (striping and mirroring) on your main machine,
and backup to a more efficient but slower RAID 6 on your backup
target.

What we've left out is compression and deduplication.

On modern CPUs, compression is really fast. So fast that it
usually makes sense for the filesystem to try compressing all
the data it is about to write, and store the compressed data
with a flag that says it will need to be uncompressed when read.
This not only increases your available storage capacity, it can
make some reads and writes faster because less has to be
transferred to/from the relative slow disk. There is more of an
impact on rotating disks than SSDs.

Deduplication tries to match data that has already been written
and store a pointer to the existing data instead. This is an
easy problem as long as you have two things: a fast way to match
the data perfectly, and a very fast way to look up everything
that has previously been written.

It turns out that both of those subproblems scale badly. The
main use case is for storing multiple virtual machine instances,
or something similar, where you can expect every one of them to
have a large percentage of identical files stemming from the
operating system installation.

-dsr-


Reply to: