Re: deduplicating file systems: VDO with Debian?

To: debian-user@lists.debian.org
Subject: Re: deduplicating file systems: VDO with Debian?
From: David Christensen <dpchrist@holgerdanske.com>
Date: Tue, 8 Nov 2022 17:30:08 -0800
Message-id: <[🔎] e6715305-13ae-e6ca-d1f6-62f4d9457eeb@holgerdanske.com>
In-reply-to: <[🔎] dce3369cdb5251b79f0d0b21c7128bc8e558f9fb.camel@adminart.net>
References: <[🔎] 45998a8cdc61d0945fc5907bc8b30b697b3f5703.camel@adminart.net> <[🔎] CAKkunMYo=BUfLPkzme2YvsP6vz4dXd4+BU=k25D=nSgD0Fs78w@mail.gmail.com> <[🔎] 2204fb91aba42873cb7513b2514e4ffa975f5b6f.camel@adminart.net> <[🔎] tkamrf$17q8$1@ciao.gmane.io> <[🔎] 13894cc6cb833ed3fc005f80e77e1c937cb42d86.camel@adminart.net> <[🔎] 4f3b5f4a-37a7-c5f0-1eec-5bba2497960a@holgerdanske.com> <[🔎] dce3369cdb5251b79f0d0b21c7128bc8e558f9fb.camel@adminart.net>

On 11/7/22 23:13, hw wrote:

On Mon, 2022-11-07 at 21:46 -0800, David Christensen wrote:

Are you deduplicating?



Yes.

Apparently some people say bad things happen when ZFS
runs out of memory from deduplication.



Okay.


16 GiB seems to be enough for my SOHO server.

I put rsync based backups on ZFS storage with compression and
de-duplication.  du(1) reports 33 GiB for the current backups (e.g.
uncompressed and/or duplicated size).  zfs-auto-snapshot takes snapshots
of the backup filesystems daily and monthly, and I take snapshots
manually every week.  I have 78 snapshots going back ~6 months.  du(1)
reports ~3.5 TiB for the snapshots.  'zfs list' reports 86.2 GiB of
actual disk usage for all 79 backups.  So, ZFS de-duplication and
compression leverage my backup storage by 41:1.

I'm unclear as to how snapshots come in when it comes to making backups.

I run my backup script each night. It uses rsync to copy files anddirectories from various LAN machines into ZFS filesystems named aftereach host -- e.g. pool/backup/hostname (ZFS namespace) and/var/local/backup/hostname (Unix filesystem namespace). I have acron(8) that runs zfs-auto-snapshot once each day and once each monththat takes a recursive snapshot of the pool/backup filesystems. Theircontents are then available via Unix namespace at/var/local/backup/hostname/.zfs/snapshot/snapshotname. If I want torestore a file from, say, two months ago, I use Unix filesystem tools toget it.

What
if you have a bunch of snapshots and want to get a file from 6 generations of

backups ago?

Use Unix filesystem tools to copy it out of the snapshot tree. Forexample, a file from two months ago:

cp/var/local/backup/hostname/.zfs/snapshot/zfs-auto-snap_m-2022-09-01-03h21/path/to/file~/restored-file

I never figured out how to get something out of an old snapshot
and found it all confusing, so I don't even use them.

Snapshots are a killer feature. You want to figure them out. I foundthe Lucas books to be very helpful:


https://mwl.io/nonfiction/os#fmzfs

https://mwl.io/nonfiction/os#fmaz

33GB in backups is far from a terrabyte.  I have a lot more than that.



I have 3.5 TiB of backups.

For compressed and/or encrypted archives, image, etc., I do not use
compression or de-duplication


Yeah, they wouldn't compress.  Why no deduplication?



Because I very much doubt that there will be duplicate blocks in such files.

The key is to only use de-duplication when there is a lot of duplication.


How do you know if there's much to deduplicate before deduplicating?

Think about the files and how often they change ("churn"). If I'mrsync'ing the root filesystem of a half dozen FreeBSD and Linux machinesto a backup directory once a day, most of the churn will be in /home,/tmp, and /var. When I update the OS and/or packages, install software,etc., there will be more churn that day.

If you want hard numebers, fdupes(1), jdupes(1), or other tools shouldbe able to tell you.

My ZFS pools are built with HDD's.  I recently added an SSD-based vdev
as a dedicated 'dedup' device, and write performance improved
significantly when receiving replication streams.


Hm, with the ZFS I set up a couple years ago, the SSDs wore out and removing
them without any replacement didn't decrease performance.

My LAN has Gigabit Ethernet. I have operated with a degraded ZFS poolin my SOHO server, and did not notice a performance drop on my client.If I had run benchmarks on the server before and after losing aredundant device, I expect the performance drop would be obvious. But,losing redundant device means increased risk of losing all of the datain the pool.

I'm not too fond of ZFS, especially not when considering performance.  But for
backups, it won't matter.



Learn more about ZFS and invest in hardware to get performance.


David

Reply to:

Follow-Ups:
- Re: deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>

References:
- deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>
- Re: deduplicating file systems: VDO with Debian?
  - From: Anders Andersson <pipatron@gmail.com>
- Re: deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>
- Re: deduplicating file systems: VDO with Debian?
  - From: didier gaumet <didier.gaumet@gmail.com>
- Re: deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>
- Re: deduplicating file systems: VDO with Debian?
  - From: David Christensen <dpchrist@holgerdanske.com>
- Re: deduplicating file systems: VDO with Debian?
  - From: hw <hw@adminart.net>

Prev by Date: firmware-atheros - slow internet
Next by Date: Re: Increased read IO wait times after Bullseye upgrade
Previous by thread: Re: deduplicating file systems: VDO with Debian?
Next by thread: Re: deduplicating file systems: VDO with Debian?
Index(es):
- Date
- Thread