[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#632450: ITP: pmatch -- Duplicate finder and removal tool.



Lars,

On Sat, Jul 2, 2011 at 1:03 PM, Lars Wirzenius <liw@liw.fi> wrote:
> It seems you didn't Cc the bug, or debian-devel. Just in case that
> was intentional, I'm not doing it either.

My mistake, thanks for pointing it out.

> On Sat, Jul 02, 2011 at 12:36:51PM +0100, Tomasz Muras wrote:
>> It is different as it (tries to) solve the problem of not just on
>> finding the duplicates but also what should be done with them once
>> they are found (e.g. which file should be considered original and
>> which duplicate). My original motivation behind first looking for and
>> then creating this utility was cleaning up my photos: imagine
>> thousands of files in hundreds of directories that needed to be clean
>> up. I had a preference to leave some files in sorted directories,
>> while removing the duplicates from all those "dump", "backup", etc
>> ones in the automated fashion. And my top priority: I could not allow
>> for any mistakes, so I've put significant effort into testing the
>> tool.
>>
>> The second problem it solves is finding and acting on files that are
>> partial files of some other, presumably full file (e.g. not completed
>> FTP download).
>>
>> Before I started working on it I looked for similar utilities and
>> documented it [1]. Also see [2] for other usages.
>>
>> [1] http://pmatch.rubyforge.org/competition.html
>> [2] http://pmatch.rubyforge.org/usage.html
>>
>> I welcome any comments and criticism.
>> Tomek
>
> That does make pmatch seem like a very useful tool! You should add
> some summary of that information from the usage page to your long
> package description.

Agreed. I guess I did a poor job at "advertising" the package.

> Your description said you use a hash to compare files. Is that
> a hash of the complete file? I found, when developing my tool,
> that it's much faster to compare just a little bit of data from
> the beginning of the file, and since my data set had several quite
> large files, this had a big impact. (Obviously, check file size first.)
>
> I quite like your approach of writing out shell commands instead of
> doing any changes directly.
>
> Looking forward to seeing pmatch in Debian.

Agreed again, I'm planning to do more work on pmatch soon - at the
moment getting it into Debian is my priority.
Comparing the initial size may be a very good idea, especially for my
use case (photos) as most of the files are of similar size.

Thank you for your review Lars,
Tomek



Reply to: