Re: Safe File Update (atomic)
On Wed, Jan 5, 2011 at 7:26 PM, Ted Ts'o <firstname.lastname@example.org> wrote:
> On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote:
>> > If you give me a specific approach, I can tell you why it won't work,
>> > or why it won't be accepted by the kernel maintainers (for example,
>> > because it involves pouring far too much complexity into the kernel).
>> Let's consider the temp file workaround, since a lot of existing apps
>> use it. A request is to commit the source data before committing the
>> rename. Seems quite simple.
> Currently ext4 is initiating writeback on the source file at the time
> of the rename. Given performance measurements others (maybe it was
> you, I can't remember, and I don't feel like going through the
> literally hundreds of messages on this and related threads) have
> cited, it seems that btrfs is doing something similar. The problem
> with doing a full commit, which means surviving a power failure, is
> that you have to request a barrier operation to make sure the data
> goes all the way down to the disk platter --- and this is expensive
> (on the order of at least 20-30ms, more if you've written a lot to the
> We have had experience with forcing data writeback (what you call
> "commit the source data") before the rename --- ext3 did that. And it
> had some very nasty performance problems which showed up very busy
> systems where people were doing a lot of different things at the same
> time: large background writes from bittorrents and/or DVD ripping,
> compiles, web browsing, etc. If you force a large amount of data out
> when you do a commit, everything else that tries to write to the file
> system at that point stops, and if you have stupid programs (i.e.,
> firefox trying to do database updates on its UI loop), it can cause
> programs to apparently lock up, and users get really upset.
I'm not sure why other IO would be affected. Isn't this equivalent to
fsync on the source file?
It almost sounds like you lock the entire FS during the data
writeback, which shouldn't be necessary.
> So one of the questions is how much should be penalizing programs that
> are doing things right (i.e., using fsync), versus programs which are
> doing things wrong (i.e., using rename and trusting to luck). This is
> a policy question, for which you might have a different opinion than I
> might have on the subject.
> We could also simply force a synchronous data writeback at rename
> time, instead of merely starting writeback at the point of the rename.
> In the case of a program which has already done an fsync(), the
> synchronous data writeback would be a no-op, so that's good in terms
> of not penalizing programs which do things right. But the problem
> there is that there could be some renames where forcing data writeback
> is not needed, and so we would be forcing the performance hit of the
> "commit the source data" even when it might not be needed (or wanted)
> by the user.
> How often does it happen that someone does a rename on top of an
> already-existing file, where the fsync() isn't wanted. Well, I can
> think up scenarios, such as where an existing .iso image is corrupted
> or needs to be updated, and so the user creates a new one and then
> renames it on top of the old .iso image, but then gets surprised when
> the rename ends up taking minutes to complete. Is that a common
Would this be an example of an atomic non-durable use case? ;)
I thought those didn't exist?
> occurrence? Probably not, but the case of the system crashing right
> after the rename() is someone unusual as well.
Given the reports of empty files not that unusual.
The delay in this unusual case seems like a small price to pay.
> For this reason, I'm cautious about going overboard at forcing commits
> on renames; doing this has real performance implications, and it is a
> computer science truism that optimizing for the uncommon/failure case
> is a bad thing to do.
Performance is important, I agree.
But you're trading performance for safety here.
And on rename, you have to guess the user's intention: just rename or
atomic file update.
> OK, what about simply deferring the commit of the rename until the
> file writeback has naturally completed? The problem with that is
> "entangled updates". Suppose there is another file which is written
> to the same directory block as the one affected by the rename, and
> *that* file is fsync()'ed? Keeping track of all of the data
> dependencies is **hard**. See: http://lwn.net/Articles/339337/
Ah. So performance isn't the problem, it's just hard too implement.
Would've been a lot faster if you said that earlier.
Instead, you require apps to use fsync, even if they don't need/want
it, which introduces a performance hit.
Wasn't there a big problem with fsync in ext3 anyway?
BTW, with O_ATOMIC, you could avoid the updates to directory blocks
and would only have to track other updates to the same inode.
>> > But for me to list all possible approaches and tell you why each one
>> > is not going to work? You'll have to pay me before I'm willing to
>> > invest that kind of time.
>> That's not what I asked.
> Actually, it is, although maybe you didn't realize it. Look above,
> and how I had to present multiple alternatives, and then shoot them
> all down, one at a time. There are hundreds of solutions, all of them
IMO the ideal solution (from a performance point of view) is deferring
the rename. Saying the ordering would be hard to implement was easy.
> Hence why *my* counter is --- submit patches. The mere act of
> actually trying to code an alternative will allow you to determine why
> your approach won't work, or failing that, others can take your patch,
> apply them, and then demonstrate use cases where your idea completely
> falls apart. But it means that you do most of the work, which is fair
> since you're the one demanding the feature.
That's an easy counter. Not unreasonable from you, but others have
said the same.
It's quite easy to say performance would suffer but then refuse to
back up that claim.
However, writing such patches is much easier said then done. Even if
your ideas are valid.