[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension



The existing transmission phase protocol is difficult to sniff,
because correct interpretation of the server stream requires
context from the client stream (or risks false positives if
data payloads happen to contain the protocol magic numbers).  It
also prohibits the ability to do efficient sparse reads, or to
return a short read where an error is reported without also
sending length bytes of (bogus) data.

Remedy this by adding a new option request negotiation, which
affects the response of the NBD_CMD_READ command, and sets
forth rules for how future command responses must behave when
they carry a data payload.

In a few places, I list some options that have not yet been
decided during discussion; option #A[123] deals with whether
we want to allow/require structured replies even without
payloads, and option #B[12] deals with what change in
transmission flags we want the client to be able to observe
as option haggling progresses.

Signed-off-by: Eric Blake <eblake@...696...>
---
 doc/proto.md | 391 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 374 insertions(+), 17 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index c1e05c5..cd59d81 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -183,19 +183,28 @@ required to.
 ### Transmission

 There are two message types in the transmission phase: the request,
-and the reply.  The phase consists of a series of transactions, where
-the client submits requests and the server sends corresponding
-replies, with a single reply message per request, and continues until
-either side closes the connection.
+and the simple reply.  The phase consists of a series of transactions,
+where the client submits requests and the server sends corresponding
+replies, with a single simple reply message per request, and continues
+until either side closes the connection.

 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously).  Clients SHOULD use a
 handle that is distinct from all other currently pending transactions,
 but MAY reuse handles that are no longer in flight; handles need not
-be consecutive.  In each reply, the server MUST use the same value for
-handle as was sent by the client in the corresponding request.  In
-this way, the client can correlate which request is receiving a
-response.
+be consecutive.  In each reply message, the server MUST use the same
+value for handle as was sent by the client in the corresponding
+request.  In this way, the client can correlate which request is
+receiving a response.
+
+Note that it is impossible to tell by reading just the server traffic
+whether a data field of a simple reply will be present; the simple
+reply is also problematic for error handling of the `NBD_CMD_READ`
+request.  Therefore, the experimental `STRUCTURED_REPLY` extension
+creates a context-free server stream by adding an additional
+structured reply type, and documents that it is possible to have
+multiple structured reply messages (called chunks) in response to a
+single request message; see below.

 #### Request message

@@ -209,12 +218,30 @@ C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  

-#### Reply message
+#### Simple reply message

-The server replies with:
+[option #A1 - only replies with payload are affected]
+The simple reply message MUST be sent by the server in response to a
+request that requires no data payload.  It MUST also be used for the
+`NBD_CMD_READ` command if the experimental `STRUCTURED_REPLY`
+extension was not negotiated.  The message looks as follows:

-S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`)  
-S: 32 bits, error  
+[option #A2 - enabling structured replies MAY affect all other commands]
+The simple reply message MUST be sent by the server in response to all
+requests if the experimental `STRUCTURED_REPLY` extension was not
+negotiated.  It MAY also be used for requests that require no data
+payload, even when structured replies are in use.  The message looks
+as follows:
+
+[option #A3 - enabling structured replies MUST affect all commands]
+The simple reply message MUST be sent by the server in response to all
+requests if the experimental `STRUCTURED_REPLY` extension was not
+negotiated, and MUST NOT be sent otherwise.  The message looks as
+follows:
+
+[all options]
+S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`)  
+S: 32 bits, error (MAY be zero)  
 S: 64 bits, handle  
 S: (*length* bytes of data if the request is of type `NBD_CMD_READ`)  

@@ -261,6 +288,8 @@ immediately after the handshake flags field in oldstyle negotiation:
   schedule I/O accesses as for a rotational medium
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands
+- bit 6, `NBD_FLAG_SEND_DF`; defined by the `STRUCTURED_REPLY` extension;
+  see below.

 Clients SHOULD ignore unknown flags.

@@ -349,6 +378,10 @@ of the newstyle negotiation.

     Defined by the experimental `SELECT` extension; see below.

+- `NBD_OPT_STRUCTURED_REPLY` (8)
+
+    Defined by the experimental `STRUCTURED_REPLY` extension; see below.
+
 #### Option reply types

 These values are used in the "reply type" field, sent by the server
@@ -448,6 +481,8 @@ valid may depend on negotiation during the handshake phase.
   set to 1 if the client requires "Force Unit Access" mode of
   operation.  MUST NOT be set unless transmission flags included
   `NBD_FLAG_SEND_FUA`.
+- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `STRUCTURED_REPLY`
+  extension; see below

 #### Request types

@@ -456,9 +491,9 @@ The following request types exist:
 * `NBD_CMD_READ` (0)

     A read request. Length and offset define the data to be read. The
-    server MUST reply with a reply header, followed immediately by len
-    bytes of data, read offset bytes into the file, unless an error
-    condition has occurred.
+    server MUST reply with a simple reply header, followed immediately
+    by len bytes of data, read from offset bytes into the file, unless
+    an error condition has occurred.

     If an error occurs, the server SHOULD set the appropriate error code
     in the error field. The server MUST then either close the
@@ -469,13 +504,18 @@ The following request types exist:
     signalling no error), the server MUST immediately close the
     connection; it MUST NOT send any further data to the client.

+    The experimental `STRUCTURED_REPLY` extension changes from a
+    simple reply to a structured reply, in part to allow recovery
+    after a partial read and more efficient reads of sparse files; see
+    below.
+
 * `NBD_CMD_WRITE` (1)

     A write request. Length and offset define the location and amount of
     data to be written. The client MUST follow the request header with
     *length* number of bytes to be written to the device.

-    The server MUST write the data to disk, and then send the reply
+    The server MUST write the data to disk, and then send the simple reply
     message. The server MAY send the reply message before the data has
     reached permanent storage.

@@ -500,7 +540,7 @@ The following request types exist:
 * `NBD_CMD_FLUSH` (3)

     A flush request; a write barrier. The server MUST NOT send a
-    successful reply header for this request before all write requests
+    successful simple reply header for this request before all write requests
     for which a reply has already been sent to the client have reached
     permanent storage (using fsync() or similar).

@@ -554,6 +594,8 @@ The following error values are defined:
 * `ENOMEM` (12), Cannot allocate memory.
 * `EINVAL` (22), Invalid argument.
 * `ENOSPC` (28), No space left on device.
+* `EOVERFLOW` (75), Value too large; MUST NOT be sent outside of the
+  experimental `STRUCTURED_REPLY` extension; see below.

 The server SHOULD return `ENOSPC` if it receives a write request
 including one or more sectors beyond the size of the device.  It SHOULD
@@ -654,6 +696,321 @@ option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.

+### `STRUCTURED_REPLY` extension
+
+Some of the major downsides of the default simple reply to
+`NBD_CMD_READ` are as follows.  First, it is not possible to support
+partial reads (the command must succeed or fail as a whole, either len
+bytes of data must be sent or the connection must be closed).  There
+is no way to efficiently skip over portions of a sparse file that are
+known to contain all zeroes.  Finally, it is not possible to reliably
+decode the server traffic without also having context of what pending
+read requests were sent by the client.
+
+To remedy this, a `STRUCTURED_REPLY` extension is envisioned. This
+extension adds a new option request, a new transmission flag, a new
+reply type during the transmission phase, a new command flag, a new
+command error, and alters the reply to the `NBD_CMD_READ` request.
+
+* `NBD_OPT_STRUCTURED_REPLY`
+
+    The client wishes to use structured replies during the
+    transmission phase.  The option request has no additional data.
+
+    The server replies with the following:
+
+    - `NBD_REP_ACK`: Structured replies have been negotiated; the server
+      MUST set the `NBD_FLAG_SEND_DF` flag in all future transmission
+      flags, and MUST use structured replies to the `NBD_CMD_READ`
+      transmission request.  Further extensions that use structured
+      replies may now be negotiated.
+    - For backwards compatibility, clients should be prepared to also
+      handle `NBD_REP_ERR_UNSUP`; in this case, no structured replies
+      will be sent.
+
+    It is envisioned that future extensions will add other new
+    requests that also require a data payload in the reply.  Such
+    extensions MUST use a structured reply, and not a simple reply.  A
+    server that supports such extensions MUST NOT advertise those
+    extensions until the client negotiates structured replies; and a
+    client MUST NOT make use of those extensions without first
+    enabling the `NBD_OPT_STRUCTURED_REPLY` extension.
+
+* `NBD_FLAG_SEND_DF`
+
+    [option #B1 - transmission flags always mirror current state;
+    state change can be observed if negotiation happens after
+    NBD_OPT_LIST]
+    The server MUST set this transmission flag to 1 if structured
+    replies have been negotiated, and MUST NOT set this flag
+    otherwise; that way, the client MAY reliably use this flag as a
+    reliable witness of whether to expect a simple reply or structured
+    reply to the `NBD_CMD_READ` transmission request.
+
+    [option #B2 - final transmission flags are accurate, but
+    intermediate transmission flags can anticipate negotiation; state
+    change can be observed if negotiation does not happen]
+    When responding to the `NBD_OPT_EXPORT_NAME` option request (or
+    the `NBD_OPT_SELECT` request of the experimental `SELECT`
+    extension), the server MUST set this transmission flag to 1 if
+    structured replies have been negotiated, and MUST NOT set this
+    flag otherwise; that way, the client MAY reliably use the final
+    state of this flag as a reliable witness of whether to expect a
+    simple reply or structured reply to the `NBD_CMD_READ`
+    transmission request.  When responding to the `NBD_OPT_LIST`
+    option request, the server MAY set this transmission flag, even if
+    structured replies have not yet been negotiated.
+
+    [all options]
+    Additionally, clients MUST NOT set the `NBD_CMD_FLAG_DF` request
+    flag unless this transmission flag is set.
+
+* Transmission phase
+
+    The transmission phase includes a third message type: the
+    structured reply, to be used for commands where the response must
+    include a data payload.  The server MUST NOT send this reply type
+    unless the client has successfully negotiated structured replies
+    via `NBD_OPT_STRUCTURED_REPLY`.  Conversely, the server MUST NOT
+    use a simple reply for `NBD_CMD_READ` if structured replies are
+    negotiated.
+
+    [option #A1, but not #A2 or #A3]
+    The server MUST NOT use structured replies for requests that never
+    require a data payload in the response.
+
+    Unless explicitly documented for a given request, a structured
+    reply MUST occupy only one message (similar to a simple reply).
+    However, some requests document that a structured reply MAY occupy
+    multiple chunks; each chunk uses a structured reply message (all
+    with the same value for "handle"), and the `NBD_REPLY_FLAG_DONE`
+    reply flag is used to identify the final chunk.  Where multiple
+    chunks are permitted, the intermediate chunks MAY be reordered
+    within constraints documented by the request, and the chunks MAY
+    be interleaved with messages from other pending transactions; but
+    the final chunk MUST always end the reply.
+
+    A structured reply message looks as follows:
+
+    S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+    S: 16 bits, flags  
+    S: 16 bits, type  
+    S: 64 bits, handle  
+    S: 32 bits, length of payload (unsigned)  
+    S: *length* bytes of payload data (if *length* is non-zero)
+
+    The use of *length* in the reply allows context-free division of
+    the overall server traffic into individual reply messages; the
+    *type* field describes how to further interpret the payload.
+
+  * Structured reply flags
+
+    This field of 16 bits is sent by the server as part of every
+    structured reply.
+
+    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
+      more structured reply chunks will be sent for the same client
+      request, and MUST set this bit if this is the final reply.  This
+      flag must always be set in response to requests which are
+      documented as using a structured reply, but not documented as
+      permitting multiple chunks.
+
+    The server MUST NOT set any other flags without first negotiating
+    the extension with the client.  Clients that receive an
+    unrecognized flag SHOULD close the connection.
+
+  * Structured Reply types
+
+    These values are used in the "type" field of a structured reply.
+    Each type determines how to interpret the "length" bytes of
+    payload.  If the client receives an unknown or unexpected type, it
+    SHOULD close the connection.
+
+    - `NBD_REPLY_TYPE_NONE` (0)
+
+      *length* MUST be 0 (and the payload field omitted).  This type
+       MUST always be used with the `NBD_REPLY_FLAG_DONE` bit set
+       (that is, it is only useful as the final reply chunk).  If no
+       earlier error chunks were sent, then this type implies that the
+       overall client request is successful.
+
+      [option #A1]
+      Valid as a reply to `NBD_CMD_READ`.
+
+      [option #A2]
+      Valid as a reply to any request.
+
+    - `NBD_REPLY_TYPE_ERROR` (1)
+
+      This reply type represents an error chunk.  *length* MUST be
+      exactly 4.  The payload is structured as:
+
+      32 bits: error (MUST be nonzero)  
+
+      This reply represents that an error occurred, and the client MAY
+      NOT make any assumptions about partial success. This type SHOULD
+      NOT be used unless it is the final reply chunk (where the flag
+      `NBD_REPLY_FLAG_DONE` is set), or if it is immediately followed
+      by a chunk with type `NBD_REPLY_TYPE_NONE`.
+
+      [option #A1]
+      Valid as a reply to `NBD_CMD_READ`.
+
+      [option #A2]
+      Valid as a reply to any request.
+
+    - `NBD_REPLY_TYPE_ERROR_OFFSET` (2)
+
+      This reply type represents an error chunk.  *length* MUST be
+      exactly 12.  The payload is structured as:
+
+      32 bits: error (MUST be nonzero)  
+      64 bits: offset (unsigned)  
+
+      In addition to declaring that an error occurred, this type
+      provides enough additional information to inform the client
+      about any partial success.  *offset* MUST lie within the bounds
+      of the original offset and length of the client's request.  If
+      *offset* also lies within the bounds of an earlier data chunk of
+      the same reply, then the client MAY assume that data within that
+      earlier chunk is valid (while the rest of that chunk MAY be
+      bogus).  Any later data chunks of the same reply MUST NOT
+      contain the offset of this chunk.
+
+      Valid as a reply to `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_DATA` (3)
+
+      This reply type represents a data chunk.  *length* MUST be at
+      least 9.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      *length - 8* bytes: data  
+
+      This reply represents the contents of *length - 8* bytes of the
+      file, starting at *offset*.  The data MUST lie within the bounds
+      of the original offset and length of the client's request, and
+      MUST NOT overlap with any earlier data or error chunks of the
+      same reply.
+
+      Valid as a reply to `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_HOLE` (4)
+
+      This reply type represents a data chunk.  *length* MUST be
+      exactly 12.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      32 bits: hole size (unsigned)  
+
+      This reply represents that *hole size* bytes of the file (which
+      MUST be non-zero), starting at *offset*, read as all zeroes.
+      The hole MUST lie within the bounds of the original offset and
+      length of the client's request, and MUST NOT overlap with any
+      earlier data or error chunks of the same reply.
+
+      Valid as a reply to `NBD_CMD_READ`.
+
+* `NBD_CMD_FLAG_DF`
+
+    The "don't fragment" bit, valid during `NBD_CMD_READ`.  SHOULD be
+    set to 1 if the client requires the server to send at most one
+    data chunk in reply.  MUST NOT be set unless the transmission
+    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
+    `EOVERFLOW` error chunk, if the request length is too large.
+
+* `EOVERFLOW`
+
+    The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
+    client has requested `NBD_CMD_FLAG_DF` for a length that is too
+    large to read without fragmentation.  The server MUST NOT return
+    this error if the read request did not exceed 65,536 bytes, and
+    SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
+
+* `NBD_CMD_READ`
+
+    If structured replies were not negotiated, then a read request
+    MUST always be answered by a simple reply, as documented above
+    (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
+    length bytes of data according to the client's request, although
+    those bytes MAY be invalid if an error is returned, and the
+    connection MUST be closed if an error occurs after a header
+    claiming no error).
+
+    If structured replies are negotiated, then a read request MUST
+    result in a structured reply that MAY contain one or more chunks
+    (each using magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), with
+    the following additional constraints.
+
+    The server MAY split the reply into any number of data chunks
+    (reply types of `NBD_REPLY_TYPE_OFFSET_DATA` and
+    `NBD_REPLY_TYPE_OFFSET_HOLE`); each chunk MUST describe at least
+    one byte, although to minimize overhead, the server SHOULD use
+    chunks where lengths and offsets are an integer multiple of 512
+    bytes, where possible (the first and last chunk of an unaligned
+    read being the most obvious place for an exception).  The server
+    MUST NOT send data chunks that overlap each other or any earlier
+    error chunks, and MUST NOT send chunks that describe data outside
+    the offset and length of the request, but MAY send the chunks in
+    any order (the client MUST reassemble data chunks into the correct
+    order), and MAY send additional data chunks even after reporting
+    an error chunk.  Note that a request for more than 2^32 - 8 bytes
+    MUST be split into at least two chunks, so as not to overflow the
+    length field of a reply while still allowing space for the offset
+    of each chunk.  When no error is detected, the server MUST send
+    enough data chunks to cover the entire region described by the
+    offset and length of the client's request.
+
+    To minimize traffic, the server MAY set the `NBD_REPLY_FLAG_DONE`
+    on the final data chunk (in which case it MUST NOT send any
+    further non-data chunks), but MUST NOT do so if it would still be
+    possible to detect an error while transmitting the chunk.  If the
+    last data chunk is not the final reply, the server MUST send a
+    final chunk with type `NBD_REPLY_TYPE_NONE` (and the flag
+    `NBD_REPLY_FLAG_DONE` set) to indicate success, or send an error
+    chunk.
+
+    If an error is detected, the server MUST still complete the
+    transmission of any current chunk (it SHOULD use padding bytes of
+    zero for any remaining data portion of
+    `NBD_REPLY_TYPE_OFFSET_DATA`), but MAY omit further data chunks.
+    The server MUST include an error chunk as one of the subsequent
+    chunks, but MAY defer the error reporting behind other queued
+    chunks.  An error chunk of type `NBD_REPLY_TYPE_ERROR` implies
+    that the client MAY NOT make any assumptions about validity of
+    data chunks, and SHOULD either have `NBD_REPLY_FLAG_DONE` set as
+    the final chunk, or be immediately followed by a chunk of type
+    `NBD_REPLY_TYPE_NONE`.  On the other hand, an error chunk of type
+    `NBD_REPLY_TYPE_ERROR_OFFSET` gives fine-grained information about
+    which earlier data chunk(s) encountered a failure, and MAY also be
+    sent in lieu of a data chunk; as such, a server MAY still usefully
+    follow it with further data chunks or further error offsets.
+    Generally, a server SHOULD NOT mix errors with offsets with a
+    generic error.  As long as all errors are accompanied by offsets,
+    the client MAY assume that any data chunks with no subsequent
+    error are valid, that chunks with errors are valid up until the
+    reported offset, and portions of the read that do not have a
+    corresponding data chunk are not valid.  If the final data or
+    error chunk did not have the `NBD_REPLY_FLAG_DONE` bit set, then
+    the server MUST use a final `NBD_REPLY_TYPE_NONE` chunk to
+    complete the reply, but the client MUST NOT treat this type as
+    success if an earlier data chunk was sent.
+
+    A client MAY close the connection if it detects that the server
+    has sent invalid chunks (such as overlapping data, or not enough
+    data before claiming success).
+
+    In order to avoid the burden of reassembly, the client MAY set the
+    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
+    fragment the reply.  If this flag is set, the server MUST send at
+    most one data chunk, although it MAY still send multiple chunks
+    (the remaining chunks would be error chunks or a final type of
+    `NBD_REPLY_TYPE_NONE`).  A server MAY reject a client's request
+    with the error `EOVERFLOW` if the length is too large to send
+    without fragmentation, in which case it MUST NOT send a data
+    chunk; however, the server MUST NOT use this if error the client's
+    requested length does not exceed 65,536 bytes.
+
 ## About this file

 This file tries to document the NBD protocol as it is currently
-- 
2.5.5




Reply to: