Re: Bug#656142: ITP: duff -- Duplicate file finder

To: debian-devel@lists.debian.org
Subject: Re: Bug#656142: ITP: duff -- Duplicate file finder
From: Roland Mas <lolando@debian.org>
Date: Tue, 17 Jan 2012 13:41:23 +0100
Message-id: <[🔎] 87k44qxzej.fsf@mirexpress.internal.placard.fr.eu.org>
References: <[🔎] 20120116205813.24274.12515.reportbug@localhost6.localdomain6> <[🔎] 20120117091258.GA20971@havelock.liw.fi> <[🔎] 20120117093020.GA4320@type.bordeaux.inria.fr> <[🔎] 20120117104520.GA29095@havelock.liw.fi> <[🔎] 20120117110341.GL4320@type.bordeaux.inria.fr>

Samuel Thibault, 2012-01-17 12:03:41 +0100 :

[...]

> I'm not sure to understand what you mean exactly. If you have even
> just a hundred files of the same size, you will need ten thousand file
> comparisons!

  I'm sure that can be optimised.  Read all 100 files in parallel,
comparing blocks of similar offset.  You need to perform 99 comparisons
on each block for as long as blocks are identical; when one of the 99
doesn't match, you can split your set of files according to this offset
into at least 2 equivalence classes, which you consider subsets from now
on.  A subset with only one file can be eliminated from the rest of the
scan, and even if there are only multiple-file subsets, the number of
comparisons to be performed at further steps is reduced by at least one.

Roland.
-- 
Roland Mas

You can tune a filesystem, but you can't tuna fish.
  -- in the tunefs(8) manual page.

Reply to:

Follow-Ups:
- Re: Bug#656142: ITP: duff -- Duplicate file finder
  - From: Samuel Thibault <sthibault@debian.org>

References:
- Bug#656142: ITP: duff -- Duplicate file finder
  - From: Kamal Mostafa <kamal@whence.com>
- Re: Bug#656142: ITP: duff -- Duplicate file finder
  - From: Lars Wirzenius <liw@liw.fi>
- Re: Bug#656142: ITP: duff -- Duplicate file finder
  - From: Samuel Thibault <sthibault@debian.org>
- Re: Bug#656142: ITP: duff -- Duplicate file finder
  - From: Lars Wirzenius <liw@liw.fi>
- Re: Bug#656142: ITP: duff -- Duplicate file finder
  - From: Samuel Thibault <sthibault@debian.org>

Prev by Date: Re: status of DEP5: Machine-readable debian/copyright (was: Patch Tagging Guidelines: DEP-3 moved to ACCEPTED status)
Next by Date: Re: Bug#656142: ITP: duff -- Duplicate file finder
Previous by thread: Re: Bug#656142: ITP: duff -- Duplicate file finder
Next by thread: Re: Bug#656142: ITP: duff -- Duplicate file finder
Index(es):
- Date
- Thread