[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] Question about the expected behaviour of nbd-server for async ops



Hi,

I'm using the NBD protocol to implement a distributed raid like device
in userspace. The central server combines local devices and NBD devices
into a raid like compound and exports the result via NBD to the rest of
the world. This means my central server acts as both nbd-server and
nbd-client. And since I'm implementing this from scratch I'm designing
this asynchronously from the start. [Note: I'm talking as nbd client
directly to other nbd-servers, not using a running NBD device from the
kernel.]

I've read doc/proto.txt but it only explains the data structure not the
required behaviour. So I want to ask some questions to clarify what the
correct behaviour should be for what I export as NBD device to the world
and what I can expect from the nbd-servers I connect to.

1) Order of replies

Currently nbd-server works all requests in order and replies in
order. Since every request/reply has a handle to uniquely pair them I
assume replying to requests out of order is allowed and will (most
likely) be handled correctly by existing clients.

2) Overlapping requests

I assume that requests may overlap. For example a client may write a
block of data and read it again before the write was ACKed. This would
be unexpected behaviour from a proper client but not forbidden. As such
the server has to internally ensure the proper order of overlapping
requests.

3) Timing of replies and behaviour details

Now this is the big one. When should the server reply to a request and
how should it behave in detail? Most important is the barrier question
on FUA/FLUSH.

* NBD_CMD_READ: reply when it has the data, no choice there

* NBD_CMD_WRITE with NBD_FLAG_SEND_FUA:
  + NBD_CMD_FLAG_FUA:
    reply when data is on physical medium (sync_file_range(),
    fdatasync() or fsync() returned)

    Does this act as a barrier? Should the server stop processing
    further requests until the FUA has completed?

  + not NBD_CMD_FLAG_FUA:
    a) reply when the data has been recieved
    b) reply when the data has been commited to cache (write() returned)
    c) reply when the data has been commited to physical medium

    For a+b how does one report write errors that only appear after
    the reply? Report them in the next FLUSH request?

    The current behaviour is b I think unless the server runs with the
    sync option and then it is c. Is option a valid? In COW mode there
    would be little sense in waiting for the write to complete since all
    data is lost on reconnect anyway and this might be a tick faster.

* NBD_CMD_WRITE without NBD_FLAG_SEND_FUA:
  I assume the behaviour is the same as write with NBD_CMD_FLAG_FUA not
  set in the above case.

  [I wonder if there should be a NBD_FLAG_SYNC although that would be
  identical to the client setting NBD_CMD_FLAG_FUA on all writes.]

  I'm considering haveing an --auto-sync mode that runs in sync mode
  when NBD_FLAG_SEND_FUA and NBD_FLAG_SEND_FLUSH is not used. This would
  be slow for old clients and fast for new ones but ensure data
  integrity for both. Maybe something nbd-server could use too.

* NBD_CMD_DISC: Wait for all pending requests to finish, close socket

  Should this flush data before closing the socket? And if so what if
  there is an error on flush? I guess clients should send NBD_CMD_FLUSH
  prior to NBD_CMD_DISC if they care.

  What if there are more requests after this while waiting for pending
  requests to finish? Should they be ignored or return an error?

* NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
  (sync_file_range() on the whole file, fdatasync() or fsync() returned).

  I assume any delayed write errors should be returned here.

  Does this act as a barrier? Should the server stop processing
  further requests until the FLUSH has completed?

  When using write()implementing this as barrier is probably
  easiest. But using async writes (libaio or Posix aio) one would just
  insert an aio_fsync() into the queue and reply on completion.


Have I got anything wrong there? Comments?

MfG
        Goswin



Reply to: