[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension



The existing transmission phase protocol is difficult to sniff,
because correct interpretation of the server stream requires
context from the client stream (or risks false positives if
data payloads happen to contain the protocol magic numbers).  It
also prohibits the ability to do efficient sparse reads, or to
return a short read where an error is reported without also
sending length bytes of (bogus) data.

Remedy this by adding a new option request negotiation, which
affects the response of the NBD_CMD_READ command, and sets
forth rules for how future command responses must behave when
they carry a data payload.

This proposal does NOT permit structured replies to anything
other than NBD_CMD_READ, although a future proposal may wish
to make that valid (so that a server could be written that
only returns structured replies).

Signed-off-by: Eric Blake <eblake@...696...>
---
 doc/proto.md | 260 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 256 insertions(+), 4 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 3f9ee23..75b1534 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -211,6 +211,14 @@ handle as was sent by the client in the corresponding request.  In
 this way, the client can correlate which request is receiving a
 response.

+By default, there is exactly one reply message for each request
+(unless the connection is closed due to an error).  Note that it is
+impossible to tell by reading just the server traffic whether a data
+field will be present.  The experimental `Structured Read` extension
+adds an additional reply type, documents when there will be multiple
+replies to a single request, and creates a context-free server stream;
+see below.
+
 ## Values

 This section describes the value and meaning of constants (other than
@@ -255,6 +263,8 @@ immediately after the global flags field in oldstyle negotiation:
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands

+Clients SHOULD ignore unknown flags.
+
 ##### Client flags

 This field of 32 bits is sent after initial connection and after
@@ -338,6 +348,10 @@ of the newstyle negotiation.

     Defined by the experimental `SELECT` extension; see below.

+- `NBD_OPT_STRUCTURED_READ` (8)
+
+    Defined by the experimental `Structured Read` extension; see below.
+
 #### Option reply types

 These values are used in the "reply type" field, sent by the server
@@ -430,6 +444,8 @@ valid may depend on negotiation during the handshake phase.
   set to 1 if the client requires "Force Unit Access" mode of
   operation.  MUST NOT be set unless export flags included
   `NBD_FLAG_SEND_FUA`.
+- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `Structured
+  Read` extension; see below

 #### Request types

@@ -451,6 +467,10 @@ The following request types exist:
     signalling no error), the server MUST immediately close the
     connection; it MUST NOT send any further data to the client.

+    The experimental `Structured Read` extension changes the set of
+    valid replies, in part to allow recovery after a partial read and
+    more efficient reads of sparse files; see below.
+
 * `NBD_CMD_WRITE` (1)

     A write request. Length and offset define the location and amount of
@@ -536,13 +556,16 @@ The following error values are defined:
 * `ENOMEM` (12), Cannot allocate memory.
 * `EINVAL` (22), Invalid argument.
 * `ENOSPC` (28), No space left on device.
+* `EOVERFLOW` (75), Value too large.

 The server SHOULD return `ENOSPC` if it receives a write request
 including one or more sectors beyond the size of the device.  It SHOULD
 return `EINVAL` if it receives a read or trim request including one or
 more sectors beyond the size of the device.  It also SHOULD map the
-`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
-`EPERM` if it receives a write or trim request on a read-only export.
+`EDQUOT` and `EFBIG` errors to `ENOSPC`.  It SHOULD return `EOVERFLOW`
+on a request to send structured read data without fragmentation but
+where the length is too large.  Finally, it SHOULD return `EPERM` if
+it receives a write or trim request on a read-only export.

 The server SHOULD return `EINVAL` if it receives an unknown command.

@@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This extension adds
 two option requests and one error reply type, and extends one existing
 option reply type.

-* `NBD_OPT_SELECT`
+* `NBD_OPT_SELECT` (6)

     The client wishes to select the export with the given name for use
     in the transmission phase, but does not yet want to move to the
@@ -613,7 +636,7 @@ option reply type.
       handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to
       using `NBD_OPT_EXPORT_NAME`.

-* `NBD_OPT_GO`
+* `NBD_OPT_GO` (7)

     The client wishes to terminate the negotiation phase and progress to
     the transmission phase. Possible replies from the server include:
@@ -635,6 +658,235 @@ option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.

+### `Structured Read` extension
+
+Some of the major downsides of the default reply to `NBD_CMD_READ`
+(without structured replies) are as follows.  First, it is not
+possible to support partial reads (the command must succeed or fail as
+a whole, either len bytes of data must be sent or the connection must
+be closed).  There is no way to efficiently skip over portions of a
+sparse file that are known to contain all zeroes.  Finally, it is not
+possible to reliably decode the server traffic without also having
+context of what pending read requests were sent by the client.
+
+To remedy this, a `Structured Read` extension is envisioned. This
+extension adds a new option request, a new reply type during the
+transmission phase, and a new command flag, and alters the set of
+valid replies to an existing command.
+
+* `NBD_OPT_STRUCTURED_READ` (8)
+
+    The client wishes to use structured reads during the transmission
+    phase.  The option request has no additional data.
+
+    The server replies with one of the following:
+
+    - `NBD_REP_ACK`: Structured reads have been negotiated; the server
+      MUST use structured replies to `NBD_CMD_READ`
+    - `NBD_REP_UNSUP`: Structured reads are not available; the transmission
+      phase MUST remain the same as if the client had not attempted
+      `NBD_OPT_STRUCTURED_READ`
+
+* Transmission phase
+
+    The transmission phase includes a third message type: the
+    structured reply, to be used for commands where the response must
+    include a data payload.  The server MUST NOT send this reply type
+    unless the client has successfully negotiated an extension that
+    requires the use of a structured reply; this includes the
+    negotiation of Structured Reads via `NBD_OPT_STRUCTURED_READ`.
+
+    A structured reply looks as follows:
+
+    S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+    S: 16 bits, flags  
+    S: 16 bits, type  
+    S: 64 bits, handle  
+    S: 32 bits, length of payload (unsigned)  
+    S: *length* bytes of payload data
+
+    The use of *length* in the reply allows context-free division of
+    the overall server traffic into individual reply messages; the
+    *type* field describes how to further interpret the payload.
+
+    While the server is permitted to send at most one normal reply (or
+    else close the connection), a command that uses structured replies
+    may document that the server is permitted to send mutiple replies,
+    all sharing the same handle, by using the `NBD_REPLY_FLAG_DONE`
+    (bit 0) to delineate the final reply.  The server MAY interleave
+    intermediate replies to one structured command with replies
+    relating to a different handle.
+
+    A server MUST NOT send a data payload in a normal reply if
+    Structured Reads are negotiated.  It is envisioned that all future
+    extension commands that require a data payload in the response
+    will require independent option negotiation, and therefore, the
+    `NBD_CMD_READ` command is the only command that is allowed to use
+    the data payload of a normal reply, and only when Structured Reads
+    were not negotiated.  However, for ease of implementation, a
+    server MAY close the connection rather than entering transmission
+    phase if, at the end of option haggling, the client has negotiated
+    another command that requires a structured reply but did not also
+    negotiate Structured Reads.
+
+  * Structured Reply flags
+
+    This field of 16 bits is sent by the server as part of every
+    structured reply.
+
+    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
+      more structured replies will be sent for the same client
+      request, and MUST set this bit if this is the final reply.
+      Commands which are documented as using structured replies, but
+      not documented as sending multiple replies, MUST always set this
+      bit.
+
+    The server MUST NOT set any other flags without first negotiating
+    the extension with the client.  Clients that receive an
+    unrecognized flag SHOULD close the connection.
+
+  * Structured Reply types
+
+    These values are used in the "type" field of a structured reply.
+    Each type determines how to interpret the "length" bytes of data.
+    If the client receives an unknown or unexpected type, it SHOULD
+    close the connection.
+
+    - `NBD_REPLY_TYPE_NONE` (0)
+
+      *length* MUST be 0 (and the payload field omitted).  This type
+      SHOULD be used only as the final reply (that is, when
+      `NBD_REPLY_FLAG_DONE` is set), and implies that the overall
+      client request was successfully completed.  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_DATA` (1)
+
+      *length* MUST be at least 9.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      *length - 8* bytes: data
+
+      This reply represents the contents of *length - 8* bytes of the
+      file, starting at *offset*.  The data MUST lie within the
+      bounds of the original offset and length of the client's
+      request.  Valid as a reply to `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_HOLE` (2)
+
+      *length* MUST be exactly 12.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      32 bits: hole size (unsigned)
+
+      This reply represents that *hole size* bytes of the file (which
+      MUST be non-zero), starting at *offset*, read as all zeroes.
+      The hole MUST lie within the bounds of the original offset and
+      length of the client's request.  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_ERROR` (3)
+
+      *length* MUST be exactly 4.  The payload is structured as:
+
+      32 bits: error
+
+      This reply represents that an error occurred, with no further
+      details as to the offset where the error occurred; and SHOULD be
+      used only as the final reply (that is, when
+      `NBD_REPLY_FLAG_DONE` is set).  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_ERROR_OFFSET` (4)
+
+      *length* MUST be exactly 12.  The payload is structured as:
+
+      32 bits: error  
+      64 bits: offset (unsigned)
+
+      This reply represents that an error occurred while handling the
+      given offset.  *error* MUST be nonzero, and *offset* must lie
+      within the bounds of the original offset and length of the
+      client's request.  Valid as a reply to `NBD_CMD_READ`.
+
+* `NBD_CMD_FLAG_DF` (bit 1)
+
+    Valid during `NBD_CMD_READ`.  SHOULD be set to 1 if the client
+    requires the server to send at most one data chunk in reply.  MUST
+    NOT be set unless the client negotiated Structured Reads with the
+    server.
+
+* `NBD_CMD_READ`
+
+    If `NBD_OPT_STRUCTURED_READ` was not negotiated, then a read
+    request MUST always be answered by a single non-structured
+    response, as documented above (using magic 0x67446698
+    `NBD_REPLY_MAGIC`, and containing length bytes of data according
+    to the client's request, although those bytes MAY be invalid if an
+    error is returned, and the connection MUST if an error occurs
+    after a header claiming no error).
+
+    If `NBD_OPT_STRUCTURED_READ` is negotiated, then a read request
+    MUST result in one or more structured replies (each using magic
+    0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), with the following
+    additional constraints.
+
+    The server MAY split the reply into any number of data chunks,
+    using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or
+    `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least
+    one byte, although to minimize overhead, the server SHOULD use
+    chunks no smaller than 512 bytes where possible (the first and
+    last chunk of an unaligned read being the most obvious place for
+    an exception).  The server MUST NOT send chunks that overlap, and
+    MUST NOT send chunks that describe data outside the offset and
+    length of the request, but MAY send the chunks in any order (the
+    client is responsible for reassembling chunks into the correct
+    order).  Note that a request for more than 2^32 - 8 bytes MUST be
+    split into at least two chunks, so as not to overflow the length
+    field of a reply while still allowing space for the offset of each
+    chunk.
+
+    If no error is detected, then the server MUST send enough chunks
+    to cover the bytes requested.  The server MAY set the
+    `NBD_REPLY_FLAG_DONE` on the final data chunk, to minimize
+    traffic, but MUST NOT do so if it would still be possible to
+    detect an error while transmitting the chunk.  If the last data
+    chunk is not the final reply, the server MUST use
+    `NBD_REPLY_TYPE_NONE` as the final reply to indicate success.
+
+    If an error is detected, the server MUST send padding bytes to
+    complete the current chunk (if any), MUST report the error with a
+    reply type of either `NBD_REPLY_TYPE_ERROR` or
+    `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies
+    without sending the total number of bytes requested.  If one or
+    more offset errors are reported, the client MAY assume that all
+    data in chunks not including the offset, and all data within the
+    affected chunk but prior to the offset, is valid; the client MAY
+    NOT assume anything about data validity if no offset is provided.
+    The server MAY send additional chunks or offset error replies, if
+    `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
+    also reports an error (that is, the final reply MUST NOT use
+    `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
+    in constructing the final reply.  A server SHOULD NOT mix
+    `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies
+    to the same request.
+
+    A client MAY close the connection if it detects that the server
+    has sent invalid chunks (such as overlapping data, or not enough
+    data before claiming success).
+
+    In order to avoid the burden of reassembly, the client MAY set the
+    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
+    fragment the reply.  If this flag is set, the server MUST send at
+    most one `NBD_REPLY_TYPE_OFFSET_DATA` or
+    `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than
+    reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`).  If
+    the client's length request is larger than 65,536 bytes (or if a
+    later extension adds a way to negotiate a larger maximum fragment
+    size), the server MAY reject the command with `EOVERFLOW`.  The
+    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
+    was not set, or if the requested length is no larger than 65,536.
+
 ## About this file

 This file tries to document the NBD protocol as it is currently
-- 
2.5.5




Reply to: