Re: Safe File Update (atomic)
On Sun, Jan 2, 2011 at 8:09 AM, Ted Ts'o <firstname.lastname@example.org> wrote:
>> You could ask for a new (non-POSIX?) API that does not ask of a
>> POSIX-like filesystem something it cannot provide (i.e. don't ask for
>> something that requires inode->path reverse mappings). You could ask
>> for syscalls to copy inodes, etc. You could ask for whatever is needed
>> to do a (open+write+close) that is atomic if the target already exists.
>> Maybe one of those has a better chance than O_ATOMIC.
> The O_ATOMIC open flag is highly problematic, and it's not fully
> specified. What if the system is under a huge amount of memory
> pressure, and the badly behaved application program does:
> fd = open("file", O_ATOMIC | O_TRUNC);
> write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh
> <sleep for one day>
> write(fd, buf2, 1024);
Last time you ignored my response, but let's try again.
The implementation would be comparable to using a temp file, so
there's no need to keep 2 g in memory.
Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode.
> What happens if another program opens "file" for reading during the
> one day sleep period? Does it get the the old contents of "file"?
Of course, according to the definition of atomic.
> The partially written, incomplete new version of "file"? What happens
> if the file is currently mmap'ed, as Henrique has asked?
Didn't I respond to that too? Again, old file.
> What if another program opens the file O_ATOMIC during the one day
> sleep period, so the file is in the middle of getting updated by two
> different processes using O_ATOMIC?
Again equivalent to using the rename trick. One of the updates will
win and since they don't depend on the old contents there are no
> How exactly do the semantics for O_ATOMIC work?
> And given at the momment ***zero*** file systems implement O_ATOMIC,
> what should an application do as a fallback? And given that it is
Fallback could be implement in the kernel or in userland. Using rename
as a fallback sounds reasonable. Implementations could switch to
O_ATOMIC when available.
> highly unlikely this could ever be implemented for various file
> systems including NFS, I'll observe this won't really reduce
> application complexity, since you'll always need to have a fallback
> for file systems and kernels that don't support O_ATOMIC.
I don't see a reason why this couldn't be implemented by NFS.
> And what are the use cases where this really makes sense? Will people
Lots of files are written in one go. They could all use this interface.
> really code to this interface, knowing that it only works on Linux
> (there are other operating systems, out there, like FreeBSD and
FreeBSD, Solaris and AIX probably also care about file consistency.
Discussing this proposal with them would be a good idea.
> Solaris and AIX, you know, and some application programmers _do_ care
> about portability), and the only benefits are (a) a marginal
> performance boost for insane people who like to write vast number of
> 2-4 byte files without any need for atomic updates across a large
> number of these small files, and (b) the ability to keep the the file
> owner unchanged when someone other than the owner updates said file
> (how important is this _really_; what is the use case where this
> really matters?).
As you've said yourself, a lot of apps don't get this right. Why not?
Because the safe way is much more complex than the unsafe way. APIs
should be easy to use right and hard to misuse. With O_ATOMIC, I feel
this is the case. Without, it's the opposite and the consequences are
obvious. There shouldn't be a tradeoff between safety and potential
O_ATOMIC is merely a proposed way to solve this problem. I've asked
(you) for a concrete code example to do it without O_ATOMIC support,
but nobody has been able to provide one yet.
> And of course, Olaf isn't actually offerring to implement this
> hypothetical O_ATOMIC. Oh, no! He's just petulently demanding it,
> even though he can't give us any concrete use cases where this would
> actually be a huge win over a userspace "safe-write" library that
> properly uses fsync() and rename().
Not true. I've asked (you) for just such a lib, but I'm still waiting
for an answer.
Why would anyone work on an implementation if there's no agreement about it?