[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On 11/7/22 23:13, hw wrote:
On Mon, 2022-11-07 at 21:46 -0800, David Christensen wrote:

Are you deduplicating?


Yes.


Apparently some people say bad things happen when ZFS
runs out of memory from deduplication.


Okay.


16 GiB seems to be enough for my SOHO server.


I put rsync based backups on ZFS storage with compression and
de-duplication.  du(1) reports 33 GiB for the current backups (e.g.
uncompressed and/or duplicated size).  zfs-auto-snapshot takes snapshots
of the backup filesystems daily and monthly, and I take snapshots
manually every week.  I have 78 snapshots going back ~6 months.  du(1)
reports ~3.5 TiB for the snapshots.  'zfs list' reports 86.2 GiB of
actual disk usage for all 79 backups.  So, ZFS de-duplication and
compression leverage my backup storage by 41:1.

I'm unclear as to how snapshots come in when it comes to making backups.


I run my backup script each night. It uses rsync to copy files and directories from various LAN machines into ZFS filesystems named after each host -- e.g. pool/backup/hostname (ZFS namespace) and /var/local/backup/hostname (Unix filesystem namespace). I have a cron(8) that runs zfs-auto-snapshot once each day and once each month that takes a recursive snapshot of the pool/backup filesystems. Their contents are then available via Unix namespace at /var/local/backup/hostname/.zfs/snapshot/snapshotname. If I want to restore a file from, say, two months ago, I use Unix filesystem tools to get it.



What
if you have a bunch of snapshots and want to get a file from 6 generations of
backups ago?


Use Unix filesystem tools to copy it out of the snapshot tree. For example, a file from two months ago:

cp /var/local/backup/hostname/.zfs/snapshot/zfs-auto-snap_m-2022-09-01-03h21/path/to/file ~/restored-file


I never figured out how to get something out of an old snapshot
and found it all confusing, so I don't even use them.


Snapshots are a killer feature. You want to figure them out. I found the Lucas books to be very helpful:

https://mwl.io/nonfiction/os#fmzfs

https://mwl.io/nonfiction/os#fmaz


33GB in backups is far from a terrabyte.  I have a lot more than that.


I have 3.5 TiB of backups.


For compressed and/or encrypted archives, image, etc., I do not use
compression or de-duplication

Yeah, they wouldn't compress.  Why no deduplication?


Because I very much doubt that there will be duplicate blocks in such files.


The key is to only use de-duplication when there is a lot of duplication.

How do you know if there's much to deduplicate before deduplicating?


Think about the files and how often they change ("churn"). If I'm rsync'ing the root filesystem of a half dozen FreeBSD and Linux machines to a backup directory once a day, most of the churn will be in /home, /tmp, and /var. When I update the OS and/or packages, install software, etc., there will be more churn that day.


If you want hard numebers, fdupes(1), jdupes(1), or other tools should be able to tell you.



My ZFS pools are built with HDD's.  I recently added an SSD-based vdev
as a dedicated 'dedup' device, and write performance improved
significantly when receiving replication streams.

Hm, with the ZFS I set up a couple years ago, the SSDs wore out and removing
them without any replacement didn't decrease performance.


My LAN has Gigabit Ethernet. I have operated with a degraded ZFS pool in my SOHO server, and did not notice a performance drop on my client. If I had run benchmarks on the server before and after losing a redundant device, I expect the performance drop would be obvious. But, losing redundant device means increased risk of losing all of the data in the pool.


I'm not too fond of ZFS, especially not when considering performance.  But for
backups, it won't matter.


Learn more about ZFS and invest in hardware to get performance.


David


Reply to: