Re: Safe File Update (atomic)
On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote:
> > If you give me a specific approach, I can tell you why it won't work,
> > or why it won't be accepted by the kernel maintainers (for example,
> > because it involves pouring far too much complexity into the kernel).
> Let's consider the temp file workaround, since a lot of existing apps
> use it. A request is to commit the source data before committing the
> rename. Seems quite simple.
Currently ext4 is initiating writeback on the source file at the time
of the rename. Given performance measurements others (maybe it was
you, I can't remember, and I don't feel like going through the
literally hundreds of messages on this and related threads) have
cited, it seems that btrfs is doing something similar. The problem
with doing a full commit, which means surviving a power failure, is
that you have to request a barrier operation to make sure the data
goes all the way down to the disk platter --- and this is expensive
(on the order of at least 20-30ms, more if you've written a lot to the
We have had experience with forcing data writeback (what you call
"commit the source data") before the rename --- ext3 did that. And it
had some very nasty performance problems which showed up very busy
systems where people were doing a lot of different things at the same
time: large background writes from bittorrents and/or DVD ripping,
compiles, web browsing, etc. If you force a large amount of data out
when you do a commit, everything else that tries to write to the file
system at that point stops, and if you have stupid programs (i.e.,
firefox trying to do database updates on its UI loop), it can cause
programs to apparently lock up, and users get really upset.
So one of the questions is how much should be penalizing programs that
are doing things right (i.e., using fsync), versus programs which are
doing things wrong (i.e., using rename and trusting to luck). This is
a policy question, for which you might have a different opinion than I
might have on the subject.
We could also simply force a synchronous data writeback at rename
time, instead of merely starting writeback at the point of the rename.
In the case of a program which has already done an fsync(), the
synchronous data writeback would be a no-op, so that's good in terms
of not penalizing programs which do things right. But the problem
there is that there could be some renames where forcing data writeback
is not needed, and so we would be forcing the performance hit of the
"commit the source data" even when it might not be needed (or wanted)
by the user.
How often does it happen that someone does a rename on top of an
already-existing file, where the fsync() isn't wanted. Well, I can
think up scenarios, such as where an existing .iso image is corrupted
or needs to be updated, and so the user creates a new one and then
renames it on top of the old .iso image, but then gets surprised when
the rename ends up taking minutes to complete. Is that a common
occurrence? Probably not, but the case of the system crashing right
after the rename() is someone unusual as well.
Humans in general suck at reasoning about low-probability events;
that's why we are allowing low-paid TSA workers to grope
air-travellers to avoid terrorist blowing up planes midflight, while
not being up in arms over the number of deaths every year due to
For this reason, I'm cautious about going overboard at forcing commits
on renames; doing this has real performance implications, and it is a
computer science truism that optimizing for the uncommon/failure case
is a bad thing to do.
OK, what about simply deferring the commit of the rename until the
file writeback has naturally completed? The problem with that is
"entangled updates". Suppose there is another file which is written
to the same directory block as the one affected by the rename, and
*that* file is fsync()'ed? Keeping track of all of the data
dependencies is **hard**. See: http://lwn.net/Articles/339337/
> > But for me to list all possible approaches and tell you why each one
> > is not going to work? You'll have to pay me before I'm willing to
> > invest that kind of time.
> That's not what I asked.
Actually, it is, although maybe you didn't realize it. Look above,
and how I had to present multiple alternatives, and then shoot them
all down, one at a time. There are hundreds of solutions, all of them
Hence why *my* counter is --- submit patches. The mere act of
actually trying to code an alternative will allow you to determine why
your approach won't work, or failing that, others can take your patch,
apply them, and then demonstrate use cases where your idea completely
falls apart. But it means that you do most of the work, which is fair
since you're the one demanding the feature.
It doesn't scale for me to spend a huge amount of time composing
e-mails like this, which is why it's rare that I do that. You've
tricked me into it this time, which is time that I've lost and I can't
get back into doing useful things, like improving ext4.
Congratulations. It probably won't be happening again.