[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Goswin,

--On 29 May 2011 17:14:51 +0200 Goswin von Brederlow <goswin-v-b@...186...> wrote:

Point taken. However, it does mean we would end up documenting
in practice how the linux block layer behaves. Which itself is
not static.

Which makes it even more important. We are implementing how the block
layer behaves now and not how it behaves next year. So if next year
things break down you need some document telling what behaviour was
implemented as opposed to what linux happens to have now. Otherwise you
can't say if the problem is a bug in the impementation or an
incompatibility with newer linux.

If the two become incompatible then a new protocol version needs to be
done.

I think I am not expressing myself clearly. The new request and bio driver
interfaces are pretty new, and aren't particularly well documented. It's
not beyond imagination that requirements of (a) users (by which I mean file
systems) and (b) drivers will change. That change might be imposition of
more restrictions, or loosening of things up. For instance, the requirement
re overlapping writes is that filesystem users have to "be careful" and not
issue these, and drivers can handle them pretty much as they please. The
requirement on users might loosen (so the kernel does the work), and the
block layer might do the work, or alternatively the work might be passed
through to the drivers to do (in which case all the in-kernel ones would be
fixed up).

nbd is slightly unusual in that it is currently exporting the semantics of
"whatever the current kernel does today". Write reordering for overlapping
requests is a case in point. There is a very good chance one of the many
block layers we've had prior to the 2.6.3x one expected block drivers /not/
to reorder writes. That restriction has now been lifted. However, I bet if
you do reorder writes for overlapping requests, and your server is
connected to by an old kernel (I mean kernel, not client), bad things may
happen. That problem is there, and has always been there.

I am not sure the problem is therefore best solved by saying "the protocol
requires XYZ", particularly given XYZ is not actually documented either.
We'd be better documenting the behaviour of current kernels, and saying
"this client does/does not permit write reordering of overlapping requests
by the server" in some form of negotiation. The tiny number of servers that
rely on write reordering for overlapping requests (I'd suggest the number
is probably zero) could then refuse to connect.

The alternative is to put a stake in the ground with one particular kernel
revision, and say "these are the semantics required of an NBD device,
whatever changes are made to the linux block layer". There are two issues
with that. Firstly, which stake in the ground do you pick? The current
kernel? The most popular kernel (previous iteration of block layer?).
Secondly, you are asking for all sorts of cruft to be imported into the
kernel to emulate old kernel's block layer interfaces. Given this arcana
is only important to 0.1% of nbd users (as most servers simply handle
requests in order, and don't even support FLUSH or FUA), that seems
a high maintenance choice.

Exactly. The read reply for disk y would be ready to send before the one
for disk X. My case would only arise if both disk X and disk Y are ready
to reply. Which one gets to reply first?

It doesn't matter. Either can.

FUA is normally handled simply by keeping the bit in the request
and ensuring it writes through.

If only there was a write_fua() system call.

Would it make sense (and result in correct behaviour) to open the
physical disk once normally and once with O_DATASYNC and do any write
with FUA flag over the O_SYNC fd? Would that perform better than doing
write()+fsync() on FUA?

Yes absolutely. I asked Christoph H about this and he suggested this
is the best way to do it. When I removed the sync_file_range() call
I thought I left a comment saying this is what we should do. I just
didn't have time to do it. As Christoph points out, right now, fsync()
is not really an overhead because every FUA is next to a FLUSH (just
because that's how filesystems migrated from the previous barrier system)
however he plans that this may not be the case in XFS in future, and
presumably other filesystems may diverge similarly.

I can't speak for Wouter but I'd welcome a patch that did this right.
Note you need to deal with /all/ the files that might be written to.

No, CoW based filing systems *do* support FUA in that they send
them out. Go trace what (e.g.) btrfs does.

I'm saying an NBD server on a CoW based filing system can't properly do
FUA. Since sync_file_range() will not do the right thing there at all
there is no point in claiming to support FUA.

But sync_file_range is not currently used. It *always* does an fsync().

See the "#if 0" at line 1159 of nbd-server.c.

Let the client send FLUSH
requests instead. Same effect in the end.

You can't control what the client sends (save for disabling stuff
you don't want to receive). The block layer will ask for a FUA call,
and (if, e.g., Christoph does what he plans) you may well get a FUA
request at the block layer with no nearby flush. So the best thing
is to "do more than is required" (i.e. the fsync()) until one of us
has the time to do O_DATASYNC properly. However, what I wanted to do
was to get FUA into the protocol, because nbd-server is not the
only possible server out there.

Well, FUA could (and will if I have a minute) be implemented using
a shadow file and O_DATASYNC. I think there is a comment to that
effect.

So lets only advertise that we can do FUA when we actualy do it better
than FLUSH, when you have a shadow file or other impementation for
it. Otherwise be truthfull and say we only do FLUSH.

But it's up to you whether you advertise it. It only advertises it
if you put it in the config file! If you don't permit advertising
of FUA, then in the case of something (e.g. a new XFS) which sends
FUA without surrounding flush, you risk losing client data. It's
better to support FUA by "doing too much" than refuse to support it
at all. Of course it would be best to do it properly.

That was one of my earlier quetions. :) Or close to one. I haven't
patched my kernel yet to do FLUSH/FUA on NBD but maybe you could verify
that this is actualy the behaviour.

Note you don't need to patch the kernel per se. Just pull the nbd-kernel
repo from git.alex.org.uk and there is a standalone module you can
build and insmod.

Indeed. It needs O_SYNC to catch write errors on the badblock pass.

Does it actually do this?

Yes. This should easily happen with multiple LVs on a NBD. A flush of
one LV has to only drain the queue for that LV and can then flush
irespective of what the other LVs do. Are filesystems (ext4? btrfs?)
smart enough to track what requests they need to wait for for a flush
and which they don't need? I guess that makes a huge difference for
fsync() on one file while another is being written to.

I don't know. I know that ext4 (for instance) only seems to keep one
flush open.

c) Requests should be ACKed as soon as possible to minimize the delay
   until a client can savely issue a FLUSH.

That's probably true performance wise as a general point, but there is
a complexity / safety / memory use tradeoff. If you ACK every request
as soon as it comes in, you will use a lot of memory.

How do you figure that? For me a write request (all others can be freed
once they send their reply) allways uses the same amount of memory from
the time it gets read from the socket till the time it is written to
disk (cache). The memory needed doesn't change wether you ACK it once it
is read from the socket, when the write is issued or when the write
returned.

If you ACK a write request before you've written it somewhere, you
need to keep it in memory so you can write it later. Imagine you
just get a continuous stream of writes. If you do what you say
(i.e. ACK them as soon as possible, i.e. before even writing them)
the client will send them to you faster than you can deal with them,
and you will end up eating lots of memory up buffering them. Essentially
you are doing a write-behind cache. You will want to give up buffering
them at some point and start blocking, because the advantage of having
(e.g.) 1000 writes buffered over 100 is minimal (probably).

--
Alex Bligh



Reply to: