[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On Tue, 2022-11-08 at 07:19 -0500, Dan Ritter wrote:
> hw wrote: 
> > > As you say, deduplication in backup systems is quite common, and works
> > > pretty well. There's also an on-disk non-filesystem utility, rdfind,
> > > which is packaged in Debian. It can discover identical files and make
> > > them hardlinks.
> > 
> > Well, if I had all the disk space to hold 2 full copies of the data to be
> > able
> > to deduplicate it only later, I wouldn't need to deduplicate anything.
> 
> Only two copies? That's not a good use case for any of the
> deduplicators.

Why not?

> The point of rdfind is to use it in a cron job while some process is
> generating duplicate files. For example, a backup process that copies a
> filesystem every six hours will generate four identical copies of almost
> every file each day. (rsnapshot would do a better job, here.)

That only works when you can make the backups fast enough and have sufficient
disk space to create so many copies.

> > And how would pretending there are two backups while there's actually only
> > one
> > because it got deduplicated be better than having only one backup to begin
> > with?
> > (Yeah I haven't thought of that before ...)
> 
> It's not two backups, it's two very similar backups taken at
> different times, so the majority of the files are the same but
> some are different.

right

> If you want a second backup, it needs to go
> on a different machine, preferably in a different location.

That would certainly be an advantage, and I wouldn't want to deduplicate the
copies.

> Maybe you should tell us what your actual use case is rather
> than asking about realtime deduplication? It could be that
> there's a completely different solution which would make you
> happy.

The use case comes down to making backups once in a while.  When making another
backup, at least the latest previous backup must not be overwritten.  Sooner or
later, there won't be enough disk space to keep two full backups.  With disk
prices as crazy high as they currently are, I might even move discs from the
backup server to the active server when it runs out of space before I move data
into archive (without backup) or start deleting stuff.  All prices keep going
up, so I don't expect disc prices to go down.

Deduplication is only one possible way to go about it.  I'm undecided if it's
better to have only one full backup and to use snapshots instead.  Deduplicating
the backups would kinda turn two copies into only one for whatever gets
deduplicated, so that might not be better as snapshots.  Or I could use both and
perhaps save even more space.


Reply to: