[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Goswin,

--On 29 May 2011 14:53:01 +0200 Goswin von Brederlow <goswin-v-b@...186...> wrote:

That really sucks documentation wise. Because then you have to start a
hunt for further documentation which probably doesn't even exist other
than the source.

It is ok to say we implement what the linux block layer expects but it
should be spelled out in the text or at least name a file to look at for
the details. It should not be left this vague.

Point taken. However, it does mean we would end up documenting
in practice how the linux block layer behaves. Which itself is
not static.

True. And a read reply takes times (lots of data to send). In case there
are multiple replies pending it would make sense to order them so that
FUA/FLUSH get priority I think. After that I think all read replies
should go out in order of their request (oldest first) and write replies
last. Reason being that something will be waiting for the read while the
writes are likely cached. On the other hand write replies are tiny and
sending them first gets them out of the way and clears up dirty pages on
the client side faster. That might be beneficial too.

What do you think?

There's no need to specify that in the protocol. It may be how you choose
to implement it in your server; but it might not be how I choose to
implement it in mine. A good example is a RAID0/JBOD server where you
might choose to split the incoming request queue by underlying physical
device (and split requests spanning multiple devices into multiple
requests). Each device's queue could be handled by a separate thread.
This is perfectly permissible, and there needs to be no ordering between
the queues.

Obviously. That was purely an implementation question.

I think you also misunderstood me. I didn't mean that incoming requests
should be ordered in this way but that pending outgoing replies should
be.

I don't think I misunderstood. What I meant was that in such a JBOD
situation, one disk X might be replying more slowly than disk Y
(say because it happens to have a more seeky load). So a read
request to disk Y issued after a read request to disk X might
result in a disk Y reply coming before the reply for disk X.

But I just thought about something else you wrote that makes this a bad
idea. You said that a FLUSH only ensures completed requests have been
flushed.

yes

So if a FLUSH ACKed before a WRITE then the client should
assume that WRITE wasn't yet flushed and issue another FLUSH.

If a client wants to flush a particular write, it should just not
issue the flush until the write is ACK'd. No need to issue two
flushes. This is the way the request system works, not my design!

To prevent
this a FLUSH ACK should be after any WRITE ACK that it flushed to
disk.

No, it need only come after it has flushed any acknowledged writes.
IE it can send the ACK before the ACK of any other writes (for
instance unacknowledged ones issued before the flush) it happened to flush
to disk at the same time.

So there should be some limits on how much a FUA/FLUSH ACK can
skip other replies.

I don't see why it has to be any different to the current linux
request model. It's a bit odd in some ways, but it is a working system.

And this has /nothing/ to do with FUA. There are no ordering constraints
at all on FUA.

Maybe it is best to simply send out replies in the order they happen to
finish. Or send them in order they came in (only those waiting to be
send, no waiting).

I think you are overcomplicating it :-)

You may send process and send replies in any order. However, to
process a flush, you need to ensure all completed writes go to
non-volatile storage before acking. That could be as simple as
an fsync(), or (e.g.) flushing your own volatile cache to disk
and asking the component disks to flush their volatile disks.

FUA is normally handled simply by keeping the bit in the request
and ensuring it writes through.

It should be extended to have a FLUSH option then. For a simple case
that would be the same as fsync(). On a striped raid or multiple device
LV it could be reduced to only flush the required physical devices and
not all devices.

I agree it isn't particularly useful, as do certain people on
linux-kernel (see the author of the original text - Christoph Hellwig).
However, Linus himself put the syscall in. You'll need to debate
the point on linux-kernel.

Yes. But it's more than that. If you write to a CoW based filing system,
fsync (and even fdatasync) will ensure the CoW metadata is also flushed
to the device, whereas sync_file_range won't. Without the CoW metadata
being written, the data itself is not really written.

Which just means that a CoW based filing system or sparse files don't
support FUA.

No, CoW based filing systems *do* support FUA in that they send
them out. Go trace what (e.g.) btrfs does.

I think you are confusing block layer semantics (REQ_FLUSH and
REQ_FUA). There is no VFS equivalent of either REQ_FLUSH or
REQ_FUA. fsync() on a file does roughly what REQ_FLUSH does.
Opening a second file with O_DATASYNC set and writing the blocks
to that does roughly what REQ_FUA does (an fdatasync() does
rather more). nbd-server currently does "more than it needs"
for REQ_FUA, but given that almost all REQ_FUA are immediately
followed by a REQ_FLUSH, and 2 x fdatasync in a row are no
more work than one, this doesn't matter.

The idea of a FUA is that it is cheaper than a FLUSH. But
if nbd-server does fsync() in both cases then it is pointless to
announce FUA support.

Well, FUA could (and will if I have a minute) be implemented using
a shadow file and O_DATASYNC. I think there is a comment to that
effect.

However, nbd-server is not the only server in existence.

Why not return EIO on the next FLUSH? If I return success on the next
FLUSH that would make the client thing the write has successfully
migrated to the physical medium. Which would not be true.

Because
a) there may not be a next FLUSH at all

Then I will never know the write did have an error. I only see that on
fsync().

As I said, I am not saying don't error the flush. I am saying don't
only error the flush.

Say you are running "mkfs -t ext4 -c -c /dev/nbd0". Now you hit one bad
block and the device turns itself into read-only mode. Not the behaviour
you want.

Well, assuming you want the mkfs to carry on, and it wants to know
where block errors are, should be
opening the disk with O_SYNC (or O_DATASYNC these days) which will
translate into REQ_FUA as I understand it.

Consider a normal SATA disk with a write-behind cache (forget nbd
for a minute). From memory mkfs just does normal block writes. It
may do an fsync() on the block device which results in a sync
at the end. It has no way of knowing where bad blocks are anyway.
(In practice SATA devices do their own bad block management but
you probably know that).

What happens if the client mounts a filesystem with sync option? Does
nbd then get every request with FUA set?

No. The client sends FUA only when there is FUA on the request. The sync
option is a server option, and the server merely syncs after every
request.

It is also a mount option. Mounting a filesystem with sync in a local
disk and nbd should give the same behaviour.

A mount option is something different. That will (as I understand it)
cause the block layer to work synchronously, and you will get
FUA/FLUSH/whatever. No client negotiation (beyond advertising
support of these) is needed. That's just how a sync mount option
would work with any other block device.

Note nbd devices don't have to be mounted. What you are proposing
would affect raw I/O (not mounted I/O) to these block devices.

And what I am saying is that this is not current behaviour. Even with
today's nbd release and my patched kernel (none of which you are
guaranteed) you will not get one single REQ_FLUSH before dismounting an
ext2 or (on Ubuntu and some other distros) ext3 filing system with
default options. With an older kernel or client you will never get a
REQ_FLUSH *ever*. So if you throw away data because it is not flushed
when you get an NBD_CMD_DISC you *will corrupt the filing system*. Do
not do this. You should treat NBD_CMD_DISC as containing an implicit
flush (by which I mean buffer flush, not necessarily write to disk),
which it always has done (it closes the files).

I'm not talking about throwing away any data. The data will be written
or the write requests wouldn't have been ACKed.

I mean "written to non-volatile storage". You can ACK data if it's
been written to (e.g.) a volatile cache. If you do, you must commit
that cache to non-volatile storage after NBD_CMD_DISC at some stage
rather than discard it.

* NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
  (sync_file_range() on the whole file, fdatasync() or fsync()
  returned).

You only need to wait until any writes that you have sent replies
for have been flushed to disk. It may be easier to write ore than than
that (which is fine). Whilst you do not *have* to flush any commands
issued (but not completed) before the REQ_FLUSH, Jan Kara says
"*please* don't do that".

Urgs. Assuming one doesn't flush commands that where issued but not yet
completed. How then should the client force those to disk? Sleep until
they happen to be ACKed and only then flush?

Easy. The client issues a REQ_FLUSH *after* anything that needs to
be flushed to disk have been ACK'd.

This would make ACKing writes only when they reach the physical medium
(using libaio, not fsync() every write) a total no go with a file backed
device. Is that really how Linux currently works? If that's true then I
really need to switch to ACKing requests as soon as the write is issued
and not when it completes.

There's nothing to prevent you doing *more* than Linux requires. Linux
only issues a REQ_FLUSH after the write of the data it wants to go to disk
has been ACK'ed (so Jan Kara / Christoph H say, anyway). But yes,
if you want speed, you should consider ACK'ing before you have actually
done, which will mean that (by default) you have become a write-behind
cache.

Then we need to spell out what that behaviour exactly is:

a) A FLUSH affects at least all completed requests, a client must wait
   for request completion before sending a FLUSH.

Yes. Except the client only need wait for completion of those requests
it wants to ensure are flushed (not every request).

b) A FLUSH might affect other requests. (Normaly those issued but not
   yet completed before the flush is issued.)

Yes. You can always flush more than is required.

c) Requests should be ACKed as soon as possible to minimize the delay
   until a client can savely issue a FLUSH.

That's probably true performance wise as a general point, but there is
a complexity / safety / memory use tradeoff. If you ACK every request
as soon as it comes in, you will use a lot of memory.

A request driven driver simply errors the request, in which case it
is passed up and errors the relevant bios (there may be more than one).
The errored block number is largely irrelevant as there might not
be one (REQ_FLUSH) or it might be outside the bio due to merging
(that's my understanding anyway).

How? I mean that just pushes the issue down a layer. The physical disk
gets a write request, dumps the data into its cache and ACKs the
write. The driver passes up the ACK to the bio and the bio
completes. Then some time later the driver gets a REQ_FLUSH and the disk
returns a write error when it finds out it can't actually write the
block.

Color me ignorant but isn't that roughly how it will go with a disk with
write cacheing enabled?

I am not sure what the "How?" is in relation to. Remember a request based
driver doesn't deal with bios, it deals with requests. There is not a 1:1
relationship, due to merging, and due to generation of additional requests
to do flushes, etc. I think the bit that is missing is "the elevator
algorithm". All I'm saying is that there isn't an "errorred block"
that is passed up the chain - there's just an error on a particular
request, which might be a 20MB request, formed by the merger of 10 bios.
There's no indication where the error occurred as far as I know (or
if there is it is lost between layers).

--
Alex Bligh



Reply to: