Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.

To: Lars Wirzenius <liw@liw.fi>
Cc: 632450@bugs.debian.org
Subject: Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
From: Tomasz Muras <nexor1984@gmail.com>
Date: Sat, 2 Jul 2011 13:16:36 +0100
Message-id: <[🔎] CAAVC7SW8cxJWJ_d13Z7dmdJwthcN+RKs2bCYTcyCwHn-uZ=_Kw@mail.gmail.com>
Reply-to: Tomasz Muras <nexor1984@gmail.com>, 632450@bugs.debian.org
In-reply-to: <20110702120334.GA26300@havelock.liw.fi>
References: <[🔎] 20110702102736.4069.73793.reportbug@debian-unstable.beton> <[🔎] 20110702111231.GA22893@havelock.liw.fi> <CAAVC7SUiKWHYwWzMh1JDfepUXbs=vO4--NJ0r8uGbaSo9knqvg@mail.gmail.com> <20110702120334.GA26300@havelock.liw.fi>

Lars,

On Sat, Jul 2, 2011 at 1:03 PM, Lars Wirzenius <liw@liw.fi> wrote:
> It seems you didn't Cc the bug, or debian-devel. Just in case that
> was intentional, I'm not doing it either.

My mistake, thanks for pointing it out.

> On Sat, Jul 02, 2011 at 12:36:51PM +0100, Tomasz Muras wrote:
>> It is different as it (tries to) solve the problem of not just on
>> finding the duplicates but also what should be done with them once
>> they are found (e.g. which file should be considered original and
>> which duplicate). My original motivation behind first looking for and
>> then creating this utility was cleaning up my photos: imagine
>> thousands of files in hundreds of directories that needed to be clean
>> up. I had a preference to leave some files in sorted directories,
>> while removing the duplicates from all those "dump", "backup", etc
>> ones in the automated fashion. And my top priority: I could not allow
>> for any mistakes, so I've put significant effort into testing the
>> tool.
>>
>> The second problem it solves is finding and acting on files that are
>> partial files of some other, presumably full file (e.g. not completed
>> FTP download).
>>
>> Before I started working on it I looked for similar utilities and
>> documented it [1]. Also see [2] for other usages.
>>
>> [1] http://pmatch.rubyforge.org/competition.html
>> [2] http://pmatch.rubyforge.org/usage.html
>>
>> I welcome any comments and criticism.
>> Tomek
>
> That does make pmatch seem like a very useful tool! You should add
> some summary of that information from the usage page to your long
> package description.

Agreed. I guess I did a poor job at "advertising" the package.

> Your description said you use a hash to compare files. Is that
> a hash of the complete file? I found, when developing my tool,
> that it's much faster to compare just a little bit of data from
> the beginning of the file, and since my data set had several quite
> large files, this had a big impact. (Obviously, check file size first.)
>
> I quite like your approach of writing out shell commands instead of
> doing any changes directly.
>
> Looking forward to seeing pmatch in Debian.

Agreed again, I'm planning to do more work on pmatch soon - at the
moment getting it into Debian is my priority.
Comparing the initial size may be a very good idea, especially for my
use case (photos) as most of the files are of similar size.

Thank you for your review Lars,
Tomek

Reply to:

References:
- Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
  - From: Tomasz Muras <nexor1984@gmail.com>
- Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
  - From: Lars Wirzenius <liw@liw.fi>

Prev by Date: Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
Next by Date: Bug#599498: marked as done (RFH: dspam -- scalable, fast and statistical anti-spam filter)
Previous by thread: Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
Next by thread: Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.
Index(es):
- Date
- Thread