[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On Tue, 2022-11-08 at 17:30 -0800, David Christensen wrote:
> On 11/7/22 23:13, hw wrote:
> > On Mon, 2022-11-07 at 21:46 -0800, David Christensen wrote:
> 
> > Are you deduplicating?  
> 
> 
> Yes.
> 
> 
> > Apparently some people say bad things happen when ZFS
> > runs out of memory from deduplication.
> 
> 
> Okay.
> 
> 
> 16 GiB seems to be enough for my SOHO server.

Hmm, when you can backup like 3.5TB with that, maybe I should put FreeBSD on my
server and give ZFS a try.  Worst thing that can happen is that it crashes and
I'd have made an experiment that wasn't successful.  Best thing, I guess, could
be that it works and backups are way faster because the server doesn't have to
actually write so much data because it gets deduplicated and reading from the
clients is faster than writing to the server.

> > > I put rsync based backups on ZFS storage with compression and
> > > de-duplication.  du(1) reports 33 GiB for the current backups (e.g.
> > > uncompressed and/or duplicated size).  zfs-auto-snapshot takes snapshots
> > > of the backup filesystems daily and monthly, and I take snapshots
> > > manually every week.  I have 78 snapshots going back ~6 months.  du(1)
> > > reports ~3.5 TiB for the snapshots.  'zfs list' reports 86.2 GiB of
> > > actual disk usage for all 79 backups.  So, ZFS de-duplication and
> > > compression leverage my backup storage by 41:1.
> > 
> > I'm unclear as to how snapshots come in when it comes to making backups. 
> 
> 
> I run my backup script each night.  It uses rsync to copy files and 

Aww, I can't really do that because my servers eats like 200-300W because it has
so many disks in it.  Electricity is outrageously expensive here.

> directories from various LAN machines into ZFS filesystems named after 
> each host -- e.g. pool/backup/hostname (ZFS namespace) and 
> /var/local/backup/hostname (Unix filesystem namespace).  I have a 
> cron(8) that runs zfs-auto-snapshot once each day and once each month 
> that takes a recursive snapshot of the pool/backup filesystems.  Their 
> contents are then available via Unix namespace at 
> /var/local/backup/hostname/.zfs/snapshot/snapshotname.  If I want to 
> restore a file from, say, two months ago, I use Unix filesystem tools to 
> get it.

Sounds like a nice setup.  Does that mean you use snapshots to keep multiple
generations of backups and make backups by overwriting everything after you made
a snapshot?

In that case, is deduplication that important/worthwhile?  You're not
duplicating it all by writing another generation of the backup but store only
what's different through making use of the snapshots.

> > What
> > if you have a bunch of snapshots and want to get a file from 6 generations
> > of
> > backups ago?  
> 
> 
> Use Unix filesystem tools to copy it out of the snapshot tree.  For 
> example, a file from two months ago:
> 
>      cp 
> /var/local/backup/hostname/.zfs/snapshot/zfs-auto-snap_m-2022-09-01-
> 03h21/path/to/file 
> ~/restored-file
> 

cool

> > I never figured out how to get something out of an old snapshot
> > and found it all confusing, so I don't even use them.
> 
> 
> Snapshots are a killer feature.  You want to figure them out.  I found 
> the Lucas books to be very helpful:
> 
> https://mwl.io/nonfiction/os#fmzfs
> 
> https://mwl.io/nonfiction/os#fmaz

I know, I only never got around to figure it out because I didn't have the need.
But it could also be useful for "little" things like taking a snapshot of the
root volume before updating or changing some configuration and being able to
easily to undo that.

> > 33GB in backups is far from a terrabyte.  I have a lot more than that.
> 
> 
> I have 3.5 TiB of backups.
> 
> 
> > > For compressed and/or encrypted archives, image, etc., I do not use
> > > compression or de-duplication
> > 
> > Yeah, they wouldn't compress.  Why no deduplication?
> 
> 
> Because I very much doubt that there will be duplicate blocks in such files.

Hm, would it hurt?

> > > The key is to only use de-duplication when there is a lot of duplication.
> > 
> > How do you know if there's much to deduplicate before deduplicating?
> 
> 
> Think about the files and how often they change ("churn").  If I'm 
> rsync'ing the root filesystem of a half dozen FreeBSD and Linux machines 
> to a backup directory once a day, most of the churn will be in /home, 
> /tmp, and /var.  When I update the OS and/or packages, install software, 
> etc., there will be more churn that day.
> 
> 
> If you want hard numebers, fdupes(1), jdupes(1), or other tools should 
> be able to tell you.

ok, ty

> > > My ZFS pools are built with HDD's.  I recently added an SSD-based vdev
> > > as a dedicated 'dedup' device, and write performance improved
> > > significantly when receiving replication streams.
> > 
> > Hm, with the ZFS I set up a couple years ago, the SSDs wore out and removing
> > them without any replacement didn't decrease performance.
> 
> 
> My LAN has Gigabit Ethernet.  I have operated with a degraded ZFS pool 
> in my SOHO server, and did not notice a performance drop on my client. 
> If I had run benchmarks on the server before and after losing a 
> redundant device, I expect the performance drop would be obvious.  But, 
> losing redundant device means increased risk of losing all of the data 
> in the pool.

Oh it's not about performance when degraded, but about performance.  IIRC when
you have a ZFS pool that uses the equivalent of RAID5, you're still limited to
the speed of a single disk.  When you have a mysql database on such a ZFS
volume, it's dead slow, and removing the SSD cache when the SSDs failed didn't
make it any slower.  Obviously, it was a bad idea to put the database there, and
I wouldn't do again when I can avoid it.  I also had my data on such a volume
and I found that the performance with 6 disks left much to desire.

> 
> > I'm not too fond of ZFS, especially not when considering performance.  But
> > for
> > backups, it won't matter.
> 
> 
> Learn more about ZFS and invest in hardware to get performance.

Hardware like?  In theory, using SSDs for cache with ZFS should improve
performance.  In practise, it only wore out the SSDs after a while, and now it's
not any faster without SSD cache.


Reply to: