Re: Safe File Update (atomic)
On Sun, Jan 02, 2011 at 03:14:41PM -0200, Henrique de Moraes Holschuh wrote:
> 1. Create unlinked file fd (benefits from kernel support, but doesn't
> require it). If a filesystem cannot support this or the boundary conditions
> are unaceptable, fail. Needs to know the destination name to do the unliked
> create on the right fs and directory (otherwise attempts to link the file
> later would have to fail if the fs is different).
This is possible. It would be specific only to file systems that
support inodes (i.e., ix-nay for NFS, FAT, etc.). Some file systems
would want to know a likely directory where the file would be linked
so for their inode and block allocation policies can optimize the
inode and block placement.
> 2. fd works as any normal fd to an unlinked regular file.
> 3. create a link() that can do unlink+link atomically. Maybe this already
> exists, otherwise needs kernel support.
> The behaviour of (3) should allow synchrous wait of a fsync() and a sync of
> the metadata of the parent dir. It doesn't matter much if it does
> everything, or just calling fsync(), or creating a fclose() variant that
> does it.
OK, so this is where things get trickly. The first is you are asking
for the ability to take a file descriptor and link it into some
directory. The inode associated with the fd might or might not be
already linked to some other directory, and it might or might not be
owned by the user trying to do the link. The latter could get
problematical if quota is enabled, since it does open up a new
potential security exposure.
A user might pass a file descriptor to another process in a different
security domain, and that process could create a link to some
directory which the original user doesn't have access to. The user
would no longer be able to delete file and drop quota, and the process
would retain permanent access to the file, which it might not
otherwise have if the inode was protected by a parent directory's
permissions. It's for the same reason that we can't just implement
open-by-inode-number; even if you use the inode's permissions and
ACL's to do a access check, this allows someone to bypass security
controls based on the containing directory's permissions. It might
not be a security exposure, but for some scenarios (i.e., a mode 600
~/Private directory that contains world-readable files), it changes
accessibility of some files.
We could control for this by only allowing the link to happen if the
user executing this new system call owns the inode being linked, so
this particular problem is addressable.
The larger problem is this doesn't solve give you any performance
benefits over simply creating a temporary file, fsync'ing it, and then
doing the rename. And it doesn't solve the problem that userspace is
responsible for copying over the extended attributes and ACL
information. So in exchange for doing something non-portable which is
Linux specific, and won't work on FAT, NFS, and other non-inode based
file systems at all, and which requires special file-system
modifications for inode-based file systems --- the only real benefit
you get is that the temp file gets cleaned up automatically if you
crash before the link/unlink new magical system call is completed.
Is it worth it? I'm not at all convinced.
Can this be fixed? Well, I suppose we could have this magical
link/unlink system call also magically copy over the xattr and acl's.
And if you don't care about when things happen, you could have the
kernel fork off a kernel thread, which does the fsync, followed by the
magic ACL and xattr copying, and once all of this completes, it could
do the magic link/unlink.
So we could bundle all of this into a system call. *Theoretically*.
But then someone else will say that they want to know when this magic
link/unlink system call actually completes. Others might say that
they don't care about the fsync happening right away, but would rather
wait some arbitary time, and let the system writeback algorithsm write
back the file *whenever*, but only when the file is written back
*whenever*, should the rest of the magical link/unlink happen.
So now we have an explosion of complexity, with all sorts of different
variants. And there's also the problem where if you don't do don't
make the system call synchronous (where it does an fsync() and waits
for it to complete), you'll lose the ability to report errors back to
Which gets me back to the question of use cases. When are we going to
be using this monster? For many use cases, where the original reason
why we said people were doing it wrong because they weren't doing
things right, the risk was losing data. But if you don't do things
synchronously, and use fsync(), you'll also end up risking losing data
because you won't know about write failures --- specifically, your
program may have long exited by the time the write failure is noticed
by the kernel. But if you use make the system call synchronous, now
there's no performance advantage over simply doing the fsync() and
rename() in userspace. And if we do this using O_ATOMIC, or your
scheme with unlinked file descriptors and the magic link/unlink by fd
system call, it means the application programmers have to modify their
programs anyway --- so why not modify them to use the userspace
library that does safe writing?
So is all of this effort really worth it at the end of the day?
When you sum it all up, the only way that it makes sense is if one of
the following applies:
1) You care about data loss in the case of power failure, but not in
the case of hard drive or storage failure, *AND* you are writing tons
and tons of tiny 3-4 byte files and so you are worried about
performance because you're doing something insane with large number of
2) You are specifically worried about the case where you are replacing
the contents of a file that is owned by different uid than the user
doing the data file update, in a safe way where you don't want a
partially written file to replace the old, complete file, *AND* you
care about the file's ownership after the data update.
3) You care about the temp file used by the userspace library, or
application which is doing the write temp file, fsync(), rename()
scheme, being automatically deleted in case of a system crash or a
process getting sent an uncatchable signal and getting terminated.
Against these possible scenarios where some new kernel code might be a
win, you have to weigh:
A) Lack of OS portability to other POSIX operating systems: Mac OS X,
Solaris, FreeBSD, AIX, etc.
B) Lack of portability for file sytstems that don't use inodes as a
basis for their design.
C) Lack of portability for file systems that haven't been hacked to
support this new scheme, even if they are inode-based.
Is it worth it? I'd say no; and suggest that someone who really cares
should create a userspace application helper library first, since
you'll need it as a fallback for the cases listed above where this
scheme won't work. (Even if you do the fallback in the kernel, you'll
still need userspace fallback for non-Linux systems, and for when the
application is run on an older Linux kernel that doesn't have all of
this O_ATOMIC or link/unlink magic).
The scheme you suggested is certainly *technically feasible* in terms
of something that could be implemented. Whether or not it would be
worth it, given that portable applications won't be able to count on
it being present, is a very different question entirely.
The reality is we've lived without this capability in Unix and Linux
system for something like three decades. I suspect we can live
without it for the next couple of decades without it being the end of