[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Non-identical files with identical md5sums on Debian systems?



On Sun, Aug 04, 2013 at 10:24:59PM -0700, Vincent Cheng wrote:
> On Sun, Aug 4, 2013 at 9:44 PM, Fabian Greffrath <fabian@greffrath.com> wrote:
> > I do occasionally check for identical files on different systems by
> > comparing their md5sums. So, just out of interest, could someone tell me
> > (how to find out) how many non-identical files with identical md5sums
> > there are there on a typical (say, amd64) Debian system?
> 
> The closest thing to what you want may be dedup.debian.net, but I
> don't think it lets you filter out non-identical files.

Indeed this task can be solved with the software backing
dedup.debian.net. The general assumption is that sha512 is
collision-free. I can give a rough idea on how to do that:

1) Obtain the software.
2) Modify schema.sql to add md5 to the functions table.
3) Modify importpkg.py to record md5 hashes.
4) Follow the steps in README to import a local Debian mirror.
   (This takes about 7 hours on a quick 8 core box and 3 days on a
   slower single core.)
5) Look for files, that have same md5 hash, but different sha512 hash.
   Something like this SQL query will give you an answer (untested).

   SELECT h1.cid, h2.cid FROM hash AS h1 JOIN hash AS h2 ON h1.fid = h2.fid AND h1.hash = h2.hash JOIN hash AS h3 ON h1.cid = h3.cid JOIN hash AS h4 ON h2.cid = h4.cid AND h3.fid = h4.fid JOIN function AS f1 ON h1.fid = f1.id JOIN function AS f3 ON h3.fid = f3.id WHERE h3.hash != h4.hash AND f1.name = 'md5' AND f3.name = 'sha512';

   It gives keys into the content table to look up the actual filenames
   and packages.

In case you have any questions, just ask (mail or #-qa on oftc).

Helmut


Reply to: