[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#662080: ITP: hadori -- Hardlinks identical files



On Sun, Mar 04, 2012 at 07:26:13PM +0100, Timo Weingärtner wrote:
> Hallo Julian Andres,
> 
> 2012-03-04 um 12:31:39 schriebst Du:
> > But in any case, avoiding yet another tool with the same security
> > issues (CVE-2011-3632) and bugs (and more bugs) as we currently
> > have would be a good idea.
> > 
> > hadori bugs:
> >   - Race, possible data loss: Calls unlink() before link(). If
> >     link() fails the data might be lost (best solution appears
> >     to be to link to a temporary file in the target directory
> >     and then rename to target name, making the replacement
> >     atomic)
> 
> I copied that from ln -f, which has the same bug then.

Could be. The other way is definitely safer.

> 
> >   - Error checking: Errors when opening files or reading
> >     files are not checked (ifstream uses the failbit and
> >     stuff).
> 
> If only one of the files fails nothing bad happens. If both fail bad things 
> might happen, that's right.

> 
> > Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
> > 	[Unsafe operations on changing trees]
> >   - If a regular file is replaced by a non-regular one before an
> >     open() for reading, the program reads from a non-regular file
> >   - A source file is replaced by one file with different owner
> >     or permissions after the stat() and before the link()
> >   - A component of the path is replaced by a symbolic link after
> >     the initial stat()ing and readdir()ing. An attacker may use
> >     that to write outside of the intented directory.
> > 
> > (Fixed in Fedora's hardlink, and my hardlink by adding a section
> >  to the manual page stating that it is not safe to run the
> >  program on changing trees).
> 
> I think that kind of bugs will stay until it is possible open/link by inode 
> number. Perhaps *at() can help at the file currently examined.

Nobody said they have to be fixed. As I wrote, the "fix" was to mention
it in the manpage.

> 
> Right now I only used it for my backups which are only accessible by me (and 
> root).
> 
> > Possibly hardlink only bugs:
> >    - Exaggeration of sizes. hardlink currently counts every
> >      link replaced -st_size, even is st_nlink > 1. I don't
> >      know what hadori does there.
> 
> hadori does not have statistics. They should be easy to add, but I had no use 
> for them.
> 
> > You can also drop your race check. The tool is unsafe on
> > changing trees anyway, so you don't need to check whether
> > someone else deleted the file, especially if you're then
> > linking to it anyway.
> 
> I wanted it to exit when something unexpected happens.

But then there are just other cases where you don't, like
opening files. And also, users will complain if you exit
just because one file has a problem. They expect the
program to continue with the remaining ones (at least they
expected this in my audio conversion script, so I do this
in hardlink as well).

> 
> > I knew that there were problems on large trees in 2009, but got nowhere with
> > a fix in Python. We still have the two passes in hardlink and thus need to
> > keep all the files currently, as I did not carry the link-first mode over
> > from my temporary C++ rewrite, as memory usage was not much different in my
> > test case. But as my test case was just running on /, the whole thing may
> > not be representative. If there are lots of duplicates, link-first can
> > definitely help.
> > 
> > The one that works exactly as as you want is most likely Fedora's hardlink.
> 
> I've searched for other implementations and all the others do two passes when 
> one is obviously enough.

Fedora's hardlink should do one pass only and link the first file to later
files. It's fairly simple code. But nobody apart from Fedora and some
other RPM distributions use it.

> 
> > Yes. It looks readable, but also has far less features than hardlink (which
> > were added to hardlink because of user requests).
> 
> I still don't get what --maximize (and --minimize) are needed for. In my 
> incremental full backup scenario I get best results with keep-first. When 
> hardlinking only $last and $dest (see below) even --maximize can disconnect 
> files from older backups.

That's to be expected. I don't know the reasons either. I thought
they came from hardlinkpy (of which hardlink was a rewrite with
the goal of increasing object-orientedness, and readability). But
it seems they didn't. It turns out that this was a user request,
or actually a feature of that user's fork of hardlinkpy:

  **** BEGINNE LOGBUCH UM Fri Dec 19 15:48:39 2008

  Dez 19 18:54:56 <chf> Gegebenenfalls sollte man es zusammenführen, 
                        weil die Modifikation, welche ich benötige,
                        auch den Link-Count der Dateien vergleicht
                        und immer diejenige mit dem höheren erhält
                        und die mit dem niedrigeren durch einen
                        Hardlink auf die andere ersetzt.

(If a non-German-speaking person wants to read this, try a translator
 software)

I think he probably ran this on all of his backups at once. And it
makes sense if you run it on a single directory as well. The
option that does not make sense at all is --minimize, though,
I don't know why anyone would want this. Historical garbage in
my opinion.


> 
> > > It
> > > started with tree based map and multimap, now it uses the unordered_
> > > (hash based) versions which made it twice as fast in a typical workload.
> > 
> > That's strange. In my (not published) C++ version of hardlink, unordered
> > (multi) maps were only slightly faster than ordered ones. I then rewrote
> > the code in C to make it more readable to the common DD who does not
> > want to work with C++, and more portable.
> > 
> > And it does not seem correct if you spend so much time in the map, at
> > least not without caching. And normally, you most likely do not have
> > the tree(s) you're hardlinking on cached.
> 
> I have, because I usually run:
> $ rsync -aH $source $dest --link-dest $last
> $ hadori $last $dest

OK, but then you're having either large amounts of RAM or not
much to compare. If I read one million files, my cache is
practically useless afterwards and only takes care of some
of them, but most have to be re-read.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.

Attachment: pgpts2rgjtcRX.pgp
Description: PGP signature


Reply to: