Re: [Debconf-discuss] btrfs deduplication / bedup users at DebConf13?
On Thu, Aug 15, 2013 at 11:14 PM, Adam Borowski <firstname.lastname@example.org> wrote:
> So you're using a filesystem with the capability you want, without actually
> using it. The "clone" ioctl does a copy-on-write link, without changes of
> semantics involved with hard links.
Sorry, I didn't make it clear that btrfs wasn't a option to me : I'm
stuck with ext3 as I'm working with a 2.6.32 kernel. Yet, if there's
some real benefit, ext4 or xfs could be considered.
On Fri, Aug 16, 2013 at 3:26 AM, Rogério Brito <email@example.com> wrote:
> I'm not Steve, but this is *much* easier than the deduplication of blocks... :)
Yes. I think s.d.n has a *very* different usage pattern for dedup than
a normal FS. It's mostly aimed at a huge fileset, with very high dedup
ratio. And the best part is that writes are *completely* under control
> If you only need to use this coarse deduplication, then take a look at
> rdfind, instead of hardlink. Hardlink compares the files that are
> likely to be the same (e.g., same size) byte by byte, while rdfind
> uses hashes (md5 or sha1, at your option) to compare the files.
Yes, but as I cannot afford having hash collisions, reverting to byte
by byte comparison is indeed needed for me.
> On Thu, Aug 15, 2013 at 7:35 PM, Stefano Zacchiroli <firstname.lastname@example.org> wrote:
>> With how many files have you tried it (max)?
> I have tried it with probably much fewer files than you have (about
> 10^6 files only), but the vast majority of the time is spent with a
> cold cache with (spinning) hard disks, for both approaches.
Same here. I didn't try it with the full install yet, only with 5
nfsroot, and that's about 600k files in 80k dirs. Yet the hardlink
part seems quite time-consuming. I thought about using it only on a
subset of changed files.
On Fri, Aug 16, 2013 at 10:33 AM, Stefano Zacchiroli <email@example.com> wrote:
> none of the tools I've looked at seem to do that. I'll probably look
> into patching the one I'll end up choosing for that, but if you know of
> a similar tool that can use an external hash db, just shout!
Actually, I was also thinking about adding an option to "hardlink" for
an external db, one that could be reused from successive launches.
That would nicely enable feeding hardlink with only a "find -ctime"
list of dirs+files to choose from.
Also, since all the writes would go through nfsd, I'm thinking about
doing the interception in LD_PRELOAD instead of a full-featured FUSE.
Finally I'm even currently looking at the internals of backuppc package. It
implemented a full compress+dedup entirely in userspace, since all the
write access are tightly controlled (as the one in s.d.n might be).