[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Goswin,

--On 28 May 2011 16:37:12 +0200 Goswin von Brederlow <goswin-v-b@...186...> wrote:

My view is that this is derived from the linux request layer, in
which case (having asked much the same question on fsdevel
a couple of days ago) the answers appear to be as follows:

1) Order of replies

Currently nbd-server works all requests in order and replies in
order. Since every request/reply has a handle to uniquely pair them I
assume replying to requests out of order is allowed and will (most
likely) be handled correctly by existing clients.

Handles can be reused only once the command in question is completed.

You may process commands out of order, and reply out of order,
save that
a) all write commands *completed* before you process a REQ_FLUSH
  must be written to non-volatile storage prior to completing
  that REQ_FLUSH (though apparently you should, if possible, make
  this true for all write commands *received*, which is a stronger
  condition) [Ignore this if you don't set SEND_REQ_FLUSH]
b) a REQ_FUA flagged write must not complete until its payload
  is written to non-volatile storage [ignore this if you don't
  set SEND_REQ_FUA]

2) Overlapping requests

I assume that requests may overlap. For example a client may write a
block of data and read it again before the write was ACKed. This would
be unexpected behaviour from a proper client but not forbidden.

Correct

As such
the server has to internally ensure the proper order of overlapping
requests.

Slightly surprisingly, the fsdevel folk's answer to this is that you
can disorder both reads and writes and do what is natural, i.e. do
not maintain ordering. A file system which cares about the result
should not issue reads of blocks for which the writes have not
completed.

3) Timing of replies and behaviour details

Now this is the big one. When should the server reply to a request and
how should it behave in detail? Most important is the barrier question
on FUA/FLUSH.

* NBD_CMD_READ: reply when it has the data, no choice there

Technically you need not reply as soon as you have data, but you
can't reply before.

* NBD_CMD_WRITE with NBD_FLAG_SEND_FUA:
  + NBD_CMD_FLAG_FUA:
    reply when data is on physical medium (sync_file_range(),
    fdatasync() or fsync() returned)

(don't use sync_file_range() - see the comment in the source)

    Does this act as a barrier? Should the server stop processing
    further requests until the FUA has completed?

No, it should not act as a barrier. You may disorder requests across
FUA. No you do not need to stop processing requests.

See Documentation/block/writeback_cache_control.txt in the linux
kernel sources.

  + not NBD_CMD_FLAG_FUA:
    a) reply when the data has been recieved
    b) reply when the data has been commited to cache (write() returned)
    c) reply when the data has been commited to physical medium

You may do any of those. Provided you will write the data "eventually"
(i.e. when you receive a REQ_FLUSH or a disconnect).

    For a+b how does one report write errors that only appear after
    the reply? Report them in the next FLUSH request?

You don't. To be safe, I'd error every write (i.e. turn the medium
read only).

    The current behaviour is b I think

correct, but this is indistinguishable from a client point of view from
(a).

    unless the server runs with the
    sync option and then it is c. Is option a valid? In COW mode there
    would be little sense in waiting for the write to complete since all
    data is lost on reconnect anyway and this might be a tick faster.

CoW+synchronous writes are in general not useful unless you have
something else examining the CoW file.

* NBD_CMD_WRITE without NBD_FLAG_SEND_FUA:
  I assume the behaviour is the same as write with NBD_CMD_FLAG_FUA not
  set in the above case.

The semantics are the same as if you had set NBD_FLAG_SEND_FUA; no
incoming requests will have the FUA bit set.

  [I wonder if there should be a NBD_FLAG_SYNC although that would be
  identical to the client setting NBD_CMD_FLAG_FUA on all writes.]

There's not much point the server asking the client to set a bit on
every request, because it can set it itself; it's /practically/ identical
to FUA on everything (I think there may be a difference in metadata,
i.e. syncing mtime).

  I'm considering haveing an --auto-sync mode that runs in sync mode
  when NBD_FLAG_SEND_FUA and NBD_FLAG_SEND_FLUSH is not used. This would
  be slow for old clients and fast for new ones but ensure data
  integrity for both. Maybe something nbd-server could use too.

Note that even if you set NBD_FLAG_SEND_{FUA,FLUSH} you won't actually
get FUA / FLUSH unless:

a) the client supports it (because the client has to do a SETFLAGS
  ioctl)
b) the kernel supports it
c) the filing system supports it.

Because of (c), you won't get any on ext2, or (depending on distribution)
ext3, unless you mount with -obarriers=1 (some distros do this by default).
ext4 will send them by default, as does btrfs and xfs.

Therefore, even if you know (a) and (b) are satisfied, you should
at least attempt to be safe if you never receive a single flush.

* NBD_CMD_DISC: Wait for all pending requests to finish, close socket

You should reply to all pending requests prior to closing the socket
I believe, mostly as it's polite. I believe the current client doesn't
send a disconnect until all replies are in, and I also think the server
may behave a little badly here.

  Should this flush data before closing the socket? And if so what if
  there is an error on flush? I guess clients should send NBD_CMD_FLUSH
  prior to NBD_CMD_DISC if they care.

No, you should not rely on this happening. Even umount of an ext2 volume
will not send NBD_FLUSH where kernel, client, and server support it.
You don't need to write it then and there (in fact there is no 'then
and there' as an NBD_CMD_DISC has no reply), but you cannot guarantee
*at all* that you will have received any sort of flush under any
circumstances.

  What if there are more requests after this while waiting for pending
  requests to finish? Should they be ignored or return an error?

I believe it is an, um, undocumented implicit assumption that no
commands are sent after NBD_CMD_DISC is sent. The current server
just closes the socket, which will probably result in an EPIPE
upstream if the FIN packet gets back before these other commands
are written.

* NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
  (sync_file_range() on the whole file, fdatasync() or fsync() returned).

You only need to wait until any writes that you have sent replies
for have been flushed to disk. It may be easier to write ore than than
that (which is fine). Whilst you do not *have* to flush any commands
issued (but not completed) before the REQ_FLUSH, Jan Kara says "*please*
don't do that".

  I assume any delayed write errors should be returned here.

No. There is really no such thing as a "delayed write error". Consider
what happens with the current server. We write to the block cache.
That hits physical disk some time later. If the physical disk errors,
the normal fs response is for the disk to go read-only. You are right
that flush will then (probably) error, but so will every other write
command. Basically, you are stuffed if you get an error.

  Does this act as a barrier? Should the server stop processing
  further requests until the FLUSH has completed?

No, REQ_FLUSH is not (in general) a barrier. You can disorder writes
across it (i.e. you can start and complete writes issued after
the REQ_FLUSH before you have completed the REQ_FLUSH, *and* you
can avoid completing writes issued before the REQ_FLUSH when you
complete the REQ_FLUSH - save for Jan's imprecations above). There
are no longer barriers within the request system since they got
ripped out a while ago.

  When using write()implementing this as barrier is probably
  easiest. But using async writes (libaio or Posix aio) one would just
  insert an aio_fsync() into the queue and reply on completion.

There's no harm implementing it as a barrier, but it's more than
is necessary. Assuming your writes are all written to a file
when completed, all you need to do is fsync().

For more detail, read this thread:
 http://www.spinics.net/lists/linux-fsdevel/msg45584.html
as I appear to have asked the same questions as you (there was
a bit of off-list stuff too).

In this I am assuming that REQ_FLUSH and REQ_FUA semantics are defined
by the linux kernel semantics for them. I think I can safely assume
that as it was me who added them so I guess I get to decide :-)

--
Alex Bligh



Reply to: