[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Safe File Update (atomic)

On Sun, Jan 02, 2011 at 04:14:15PM +0100, Olaf van der Spek wrote:
> On Sun, Jan 2, 2011 at 8:09 AM, Ted Ts'o <tytso@mit.edu> wrote:
> > The O_ATOMIC open flag is highly problematic, and it's not fully
> > specified.

Note that on the other side of the fence there's something called TxF
(Transactional NTFS).  I don't know how fast or reliable it is, but 
browsing the docs shows some interesting things.  In particular, it is
not limited to a single file but can handle any number of changes to the

> > What if the system is under a huge amount of memory
> > pressure, and the badly behaved application program does:
> >
> >        fd = open("file", O_ATOMIC | O_TRUNC);
> >        write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh
> >        <sleep for one day>
> >        write(fd, buf2, 1024);
> >        close(fd);
> Last time you ignored my response, but let's try again.
> The implementation would be comparable to using a temp file, so
> there's no need to keep 2 g in memory.
> Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode.

And what if you're changing one byte inside a 50 GB file?
I see an easy implementation on btrfs/ocfs2 (assuming no other writers),
but on ext3/4, that'd be tricky.

> > What if another program opens the file O_ATOMIC during the one day
> > sleep period, so the file is in the middle of getting updated by two
> > different processes using O_ATOMIC?
> Again equivalent to using the rename trick. One of the updates will
> win and since they don't depend on the old contents there are no
> troubles.

On NTFS, an attempt to open a file for writing twice fails if at least one
of you and the other writer use TxF.  This goes contrary to the usual Unix
semantics (where you can always open the file for writing) but it is how SQL
works.  NTFS has bad lock granularity (the whole file rather than a row,
page or a byte range), but is straightforward.
> > How exactly do the semantics for O_ATOMIC work?
> >
> > And given at the momment ***zero*** file systems implement O_ATOMIC,

I'd count TxF as an implementation.

> > what should an application do as a fallback?  And given that it is
> Fallback could be implement in the kernel or in userland. Using rename
> as a fallback sounds reasonable. Implementations could switch to
> O_ATOMIC when available.

For large files using reflink (currently implemented as fs-specific ioctls)
can ensure performance.  It can give you anything but the abuse for
preserving owner (ie, the topic of this thread).  To get that, you'd need
in-kernel support, but for example http://lwn.net/Articles/331808/ proposes
an API which is just a thin wrapper over existing functionality in multiple
filesystems.  It basically duplicates an inode, preserving all current
attributes but making any new writes CoWed.  If you make the old one
immutable, you get the TxF semantics (mandatory write lock), if you don't,
you'll get the mentioned above "one of the updates will win" data loss.

> > highly unlikely this could ever be implemented for various file
> > systems including NFS, I'll observe this won't really reduce
> > application complexity, since you'll always need to have a fallback
> > for file systems and kernels that don't support O_ATOMIC.
> I don't see a reason why this couldn't be implemented by NFS.

Not sure how extensible NFS is, but it's just a matter of passing these
calls over network to the underlying filesystem.  Ie, the problem can be
divided into doing this locally (see above) and extending NFS.

> > And what are the use cases where this really makes sense?  Will people
> Lots of files are written in one go. They could all use this interface.

I don't see how O_ATOMIC helps there.  TxF transactions would work (all
writes either succeed together or none does), but O_ATOMIC can't do more
than one file.
> > the only benefits are (a) a marginal performance boost for insane people
> > who like to write vast number of 2-4 byte files without any need for
> > atomic updates across a large number of these small files, and (b) the
> > ability to keep the the file owner unchanged when someone other than the
> > owner updates said file (how important is this _really_; what is the use
> > case where this really matters?).
> As you've said yourself, a lot of apps don't get this right. Why not?
> Because the safe way is much more complex than the unsafe way. APIs
> should be easy to use right and hard to misuse. With O_ATOMIC, I feel
> this is the case. Without, it's the opposite and the consequences are
> obvious. There shouldn't be a tradeoff between safety and potential
> problems.

Uhm, but you didn't answer the question.  These two use cases Ted Tso
mentioned are certainly not worth the complexity of in-kernel support,
O_ATOMIC doesn't bring other goodies, and the rest can be done by an
userspace library which is indeed a good idea.

> O_ATOMIC is merely a proposed way to solve this problem. I've asked
> (you) for a concrete code example to do it without O_ATOMIC support,
> but nobody has been able to provide one yet.

Preserving the owner of a file you write to as non-root... not an important
issue, the rest has been explained in the rest of this thread already.
> > And of course, Olaf isn't actually offerring to implement this
> > hypothetical O_ATOMIC.  Oh, no!  He's just petulently demanding it,
> > even though he can't give us any concrete use cases where this would
> > actually be a huge win over a userspace "safe-write" library that
> > properly uses fsync() and rename().

Especially if that library can be changed to use barrier()+rename() when/if
it finally becomes available.  Having to add this in just one place would
be nice.

> Not true. I've asked (you) for just such a lib, but I'm still waiting
> for an answer.

Shachar Shemesh is already working on it, when he finishes, Ted Tso will
point out what's wrong in it (if something is).  What else do you need?

> Why would anyone work on an implementation if there's no agreement about it?

Because one implementation after research is better than many naive and
possibly wrong ones.

1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to: