[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)



Hi,

i wrote:
> >   https://github.com/dm-vdo/kvdo/issues/18

hw wrote:
> So the VDO ppl say 4kB is a good block size

They actually say that it's the only size which they support.


> Deduplication doesn't work when files aren't sufficiently identical,

The definition of sufficiently identical probably differs much between
VDO and ZFS.
ZFS has more knowledge about the files than VDO has. So it might be worth
for it to hold more info in memory.


> It seems to make sense that the larger
> the blocks are, the lower chances are that two blocks are identical.

Especially if the filesystem's block size is smaller than the VDO
block size, or if the filesystem does not align file content intervals
to block size, like ReiserFS does.


> So how come that deduplication with ZFS works at all?

Inner magic and knowledge about how blocks of data form a file object.
A filesystem does not have to hope that identical file content is
aligned to a fixed block size.


didier gaumet wrote:
> > > The goal being primarily to optimize storage space
> > > for a provider of networked virtual machines to entities or customers

I wrote:
> > Deduplicating over several nearly identical filesystem images might indeed
> > bring good size reduction.

hw wrote:
> Well, it's independant of the file system.

Not entirely. As stated above, i would expect VDO to work not well for
ReiserFS with its habit to squeeze data into unused parts of storage blocks.
(This made it great for storing many small files, but also led to some
performance loss by more fragmentation.)


> Do I want/need controlled redundancy with
> backups on the same machine, or is it better to use snapshots and/or
> deduplication to reduce the controlled redundancy?

I would want several independent backups on the first hand.

The highest risk for backup is when a backup storage gets overwritten or
updated. So i want several backups still untouched and valid, when the
storage hardware or the backup software begin to spoil things.

Deduplication increases the risk that a partial failure of the backup
storage damages more than one backup. On the other hand it decreases the
work load on the storage and the time window in which the backuped data
can become inconsistent on the application level.
Snapshot before backup reduces that window size to 0. But this still
does not prevent application level inconsistencies if the application is
caught in the act of reworking its files.

So i would use at least four independent storage facilities interchangeably.
I would make snapshots, if the filesystem supports them, and backup those
instead of the changeable filesystem.
I would try to reduce the activity of applications on the filesystem when
the snapshot is made.
I would allow each independent backup storage to do its own deduplication,
not sharing it with the other backup storages.


> > In case of VDO i expect that you need to use different deduplicating
> > devices to get controlled redundancy.

> How would the devices matter?  It's the volume residing on devices that gets
> deduplicated, not the devices.

I understand that one VDO device implements one deduplication.
So if no sharing of deduplication is desired between the backups, then i
expect that each backup storage needs its own VDO device.


> How can you make backups on Bluerays? They hold only 50GB or so and I'd
> need thousands of them.

My backup needs are much smaller than yours, obviously.
I have an active $HOME tree of about 4 GB and some large but less agile
data hoard of about 500 GB.
The former gets backuped 5 times per day on appendable 25 GB BD media
(as stated, 200+ days fit on one BD).
The latter gets an incremental update on a single-session 25 GB BD every
other day. A new base backup needs about 20 BD media. Each time the
single update BD is full, it joins the base backup in its cake box and a
new incremental level gets started.

If you have much more valuable data to backup then you will probably
decide for rotating magnetic storage. Not only for capacity but also for
the price/capacity ratio.
But you should consider to have at least some of your backups on
removable media, e.g. hard disks in USB boxes. Only those can be isolated
from the risks of daily operation, which i deem crucial for safe backup.


Have a nice day :)

Thomas


Reply to: