[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: deduplicating file systems: VDO with Debian?



On Mon, 2022-11-07 at 16:29 +0100, hede wrote:
> Am 07.11.2022 02:57, schrieb hw:
> > Hi,
> > 
> > Is there no VDO in Debian, and what would be good to use for 
> > deduplication with
> > Debian?  Why isn't VDO in the stardard kernel? Or is it?
> 
> I have used vdo in Debian some time ago and didn't remember big 
> problems. AFAIR I did compile it myself - no prebuild packages.

Cool, I could give that a try, ty.

> I switched to btrfs for other reasons. Not even for performance. The VDO 
> Layer eats performance, yes, but compared to naked ext4 even btrfs is 
> slow.

Really?  I never noticed that btrfs would be slow.  But then, it's been a long
time that I used ext4 ...

> > There is no point in 
> > deduplicating
> > backups after they're done because I don't need to save disk space for 
> > them when
> > I can fit them in the first place.
> 
> That's only one point.

What are the others?

>  And it's not really some valid one, I think, as 
> you do typically not run into space problems with one single action 
> (YMMV). Running multiple sessions and out-of-band deduplication between 
> them works for me.

That still requires you to have enough disk space for at least two full backups.
I can see it working for three backups because you can deduplicate the first
two, but not for two.  And why would I deduplicate when I have sufficient disk
space.

> In-band deduplication (that's the one you want) has some drawbacks, too: 
> High Ressource usage. You need plenty of RAM (up to several Gigabytes 
> per Terabyte Storage) and write success is delayed (-> slow direct i/o).

Well, if it takes 5 days or so to make a backup, that won't be very useful.  It
takes more than long enough already because my discs can only sustain so much.

> For Out-of-Band deduplication there are multiple different 
> implementations. File based dedup on directory basis can be very fast 
> and resource economical, for example via rdfind or jdupes. Block based 
> like via bees for btrfs (that's the one I use) is more close to in-band 
> deduplication (including high RAM usage). Bees can be switched off and 
> on at any time (for example if it's a small home-system which runs more 
> demanding tasks from time to time) and switching it on again resumes at 
> the last state (it starts at the last transaction id which was processed 
> -> btrfs knows its transactions).

Hm.  I wouldn't mind running it from time to time, though I don't know that I
would have a lot of duplicate data other than backups.  How much space might I
expect to gain from using bees, and how much memory does it require to run?


Reply to: