[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: definiing deduplication (was: Re: deduplicating file systems: VDO with Debian?)



On Wed, 2022-11-09 at 12:08 +0100, Thomas Schmitt wrote:
> Hi,
> 
> i wrote:
> > >   https://github.com/dm-vdo/kvdo/issues/18
> 
> hw wrote:
> > So the VDO ppl say 4kB is a good block size
> 
> They actually say that it's the only size which they support.
> 
> 
> > Deduplication doesn't work when files aren't sufficiently identical,
> 
> The definition of sufficiently identical probably differs much between
> VDO and ZFS.
> ZFS has more knowledge about the files than VDO has. So it might be worth
> for it to hold more info in memory.

Dunno, apparently they keep checksums of blocks in memory.  More checksums, more
memory ...

> > It seems to make sense that the larger
> > the blocks are, the lower chances are that two blocks are identical.
> 
> Especially if the filesystem's block size is smaller than the VDO
> block size, or if the filesystem does not align file content intervals
> to block size, like ReiserFS does.

That would depend on the files.

> > So how come that deduplication with ZFS works at all?
> 
> Inner magic and knowledge about how blocks of data form a file object.
> A filesystem does not have to hope that identical file content is
> aligned to a fixed block size.

No, but when it uses large blocks it can store more files in a block and won't
be able to deduplicate the identical files in a block because the blocks are
atoms in deduplication.  The larger the blocks are, the less likely it seems
that multiple blocks are identical.

> didier gaumet wrote:
> > > > The goal being primarily to optimize storage space
> > > > for a provider of networked virtual machines to entities or customers
> 
> I wrote:
> > > Deduplicating over several nearly identical filesystem images might indeed
> > > bring good size reduction.
> 
> hw wrote:
> > Well, it's independant of the file system.
> 
> Not entirely. As stated above, i would expect VDO to work not well for
> ReiserFS with its habit to squeeze data into unused parts of storage blocks.
> (This made it great for storing many small files, but also led to some
> performance loss by more fragmentation.)

VDO is independant of the file system, and 4k blocks are kinda small.  It
doesn't matter how files are aligned to blocks of a file system because VDO
always uses chunks of 4k each and compares them and always works the same.  You
can always create a file system with an unlucky block size for the files on it
or even one that makes sure that all the 4k blocks are not identical.  We could
call it spitefs maybe :)

> > Do I want/need controlled redundancy with
> > backups on the same machine, or is it better to use snapshots and/or
> > deduplication to reduce the controlled redundancy?
> 
> I would want several independent backups on the first hand.

Independent?  Like two full copies like I'm making?

> The highest risk for backup is when a backup storage gets overwritten or
> updated. So i want several backups still untouched and valid, when the
> storage hardware or the backup software begin to spoil things.

That's what I thought, but I'm about to run out of disk space for multiple full
copies.

> Deduplication increases the risk that a partial failure of the backup
> storage damages more than one backup. On the other hand it decreases the
> work load on the storage

It may make all backups unusable because the single copy deduplication has left
has been damaged.  However, how likely is a partial failure of a stoarge volume
to happen, and how relevant is it?  How often does a storage volume --- the
underlying media doesn't necessarily matter; for example, when a disk goes bad
in a RAID, you replace it and keep going --- goes bad in only one place?  When
the volume has gone away, so have all the copies.

>  and the time window in which the backuped data
> can become inconsistent on the application level.

Huh?

> Snapshot before backup reduces that window size to 0. But this still
> does not prevent application level inconsistencies if the application is
> caught in the act of reworking its files.

You make the snapshot of the backup before starting to make a backup, not while
making one.

Or are you referring to the data being altered while a backup is in progress?

> So i would use at least four independent storage facilities interchangeably.
> I would make snapshots, if the filesystem supports them, and backup those
> instead of the changeable filesystem.
> I would try to reduce the activity of applications on the filesystem when
> the snapshot is made.

right

> I would allow each independent backup storage to do its own deduplication,
> not sharing it with the other backup storages.

If you have them on different machines or volumes, it would be difficult to do
it otherwise.

> > > In case of VDO i expect that you need to use different deduplicating
> > > devices to get controlled redundancy.
> 
> > How would the devices matter?  It's the volume residing on devices that gets
> > deduplicated, not the devices.
> 
> I understand that one VDO device implements one deduplication.
> So if no sharing of deduplication is desired between the backups, then i
> expect that each backup storage needs its own VDO device.

right

Would you even make so many backups on the same machine?

> > How can you make backups on Bluerays? They hold only 50GB or so and I'd
> > need thousands of them.
> 
> My backup needs are much smaller than yours, obviously.
> I have an active $HOME tree of about 4 GB and some large but less agile
> data hoard of about 500 GB.
> The former gets backuped 5 times per day on appendable 25 GB BD media
> (as stated, 200+ days fit on one BD).

That makes it a lot easier.  Isn't 5 times a day a bit much?  And it's an odd
number.

> The latter gets an incremental update on a single-session 25 GB BD every
> other day. A new base backup needs about 20 BD media. Each time the
> single update BD is full, it joins the base backup in its cake box and a
> new incremental level gets started.
> 
> If you have much more valuable data to backup then you will probably
> decide for rotating magnetic storage. Not only for capacity but also for
> the price/capacity ratio.

Yes, I'm re-using the many small hard discs that have accumulated over the
years.  It's much easier and way more efficient to use few large discs for the
active data than many small ones, and using the small ones for backups is way
better than just having them laying around unused.

I wish we could still (relatively) easily make backups on tapes.  Just change
the tape every day and you can have a reasonable number of full backups.  Of
course, spooling and seeking tapes kinda sucks, but how often do you need to do
that.

> But you should consider to have at least some of your backups on
> removable media, e.g. hard disks in USB boxes. Only those can be isolated
> from the risks of daily operation, which i deem crucial for safe backup.

The backup server is turned off unless I'm making backups and its PDU port is
switched off, so not much will happen to it.  I'm bad because I'm making them
only once in a while, and last time was very long ago ...

Once I've figured out what to do, I'll make a backup.  A full new backup takes
ages and I need to stop modifying stuff and not start all over again all the
time.  I think last time I created a btrfs RAID5, being unaware that that's a
bad idea ...


Reply to: