[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#656142: ITP: duff -- Duplicate file finder



Samuel Thibault, 2012-01-17 12:03:41 +0100 :

[...]

> I'm not sure to understand what you mean exactly. If you have even
> just a hundred files of the same size, you will need ten thousand file
> comparisons!

  I'm sure that can be optimised.  Read all 100 files in parallel,
comparing blocks of similar offset.  You need to perform 99 comparisons
on each block for as long as blocks are identical; when one of the 99
doesn't match, you can split your set of files according to this offset
into at least 2 equivalence classes, which you consider subsets from now
on.  A subset with only one file can be eliminated from the rest of the
scan, and even if there are only multiple-file subsets, the number of
comparisons to be performed at further steps is reduced by at least one.

Roland.
-- 
Roland Mas

You can tune a filesystem, but you can't tuna fish.
  -- in the tunefs(8) manual page.


Reply to: