[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] [PATCH v4] doc: Propose STRUCTURED_REPLY extension



I'm feeling strongly enough that this is close to ready that I'm
probably going to merge it. It's still clearly marked as experimental,
so we can still easily fix things, but I doubt we're going to need much
more than tweaks.

(obviously, with the three other patches that I just merged, this
conflicts. Ah well)

On Thu, Mar 31, 2016 at 11:29:48PM -0600, Eric Blake wrote:
> The existing transmission phase protocol is difficult to sniff,
> because correct interpretation of the server stream requires
> context from the client stream (or risks false positives if
> data payloads happen to contain the protocol magic numbers).  It
> also prohibits the ability to do efficient sparse reads, or to
> return a short read where an error is reported without also
> sending length bytes of (bogus) data.
> 
> Remedy this by adding a new option request negotiation, which
> affects the response of the NBD_CMD_READ command, and sets
> forth rules for how future command responses must behave when
> they carry a data payload.  It also makes it possible to return
> UTF-8 human-readable messages alongside an error code; and
> therefore structured replies are permitted for all commands.
> 
> Signed-off-by: Eric Blake <eblake@...696...>
> ---
> 
> In v4: rearrange paragraphs a bit, document 'structured reply chunk
> message' up front while still deferring details to the experimental
> section, lots of wording tweaks, remove options, add optional UTF-8
> error text to error chunks
> 
>  doc/proto.md | 370 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 352 insertions(+), 18 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 2600098..5f2fc02 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -184,17 +184,31 @@ required to.
> 
>  ### Transmission
> 
> -There are two message types in the transmission phase: the request,
> -and the reply.  The phase consists of a series of transactions, where
> -the client submits requests and the server sends corresponding
> -replies, with a single reply message per request, and continues until
> -either side closes the connection.
> +There are three message types in the transmission phase: the request,
> +the simple reply, and the experimental structured reply chunk.  The
> +transmission phase consists of a series of transactions, where the
> +client submits requests and the server sends corresponding replies
> +with either a single simple reply or a series of one or more
> +structured reply chunks per request.  The phase continues until either
> +side closes the connection.
> +
> +Note that without client negotiation, the server MUST use only simple
> +replies, and that it is impossible to tell by reading the server
> +traffic in isolation whether a data field will be present; the simple
> +reply is also problematic for error handling of the `NBD_CMD_READ`
> +request.  Therefore, the experimental `STRUCTURED_REPLY` extension
> +creates a context-free server stream by introducing the use of
> +structured reply chunks; see below.
> 
>  Replies need not be sent in the same order as requests (i.e., requests
> -may be handled by the server asynchronously).  Clients SHOULD use a
> -handle that is distinct from all other currently pending transactions,
> -but MAY reuse handles that are no longer in flight; handles need not
> -be consecutive.  In each reply, the server MUST use the same value for
> +may be handled by the server asynchronously), and structured reply
> +chunks from one request may be interleaved with reply messages from
> +other requests; however, there may be constraints that prevent
> +arbitrary reordering of structured reply chunks within a given reply.
> +Clients SHOULD use a handle that is distinct from all other currently
> +pending transactions, but MAY reuse handles that are no longer in
> +flight; handles need not be consecutive.  In each reply message
> +(whether simple or structured), the server MUST use the same value for
>  handle as was sent by the client in the corresponding request.  In
>  this way, the client can correlate which request is receiving a
>  response.
> @@ -211,15 +225,25 @@ C: 64 bits, offset (unsigned)
>  C: 32 bits, length (unsigned)  
>  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> 
> -#### Reply message
> +#### Simple reply message
> 
> -The server replies with:
> +The simple reply message MUST be sent by the server in response to all
> +requests if the experimental `STRUCTURED_REPLY` extension was not
> +negotiated.  If structured replies have been negotiated, a simple
> +reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> +but only if the reply has no data payload.  The message looks as
> +follows:
> 
> -S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`)  
> -S: 32 bits, error  
> +S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`)  
> +S: 32 bits, error (MAY be zero)  
>  S: 64 bits, handle  
>  S: (*length* bytes of data if the request is of type `NBD_CMD_READ`)  
> 
> +#### Structured reply chunk message
> +
> +This reply type MUST NOT be used except as documented by the
> +experimental `STRUCTURED_READ` extension; see below.
> +
>  ## Values
> 
>  This section describes the value and meaning of constants (other than
> @@ -263,6 +287,8 @@ immediately after the handshake flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_DF`; defined by the `STRUCTURED_REPLY` extension;
> +  see below.
> 
>  Clients SHOULD ignore unknown flags.
> 
> @@ -351,6 +377,10 @@ of the newstyle negotiation.
> 
>      Defined by the experimental `SELECT` extension; see below.
> 
> +- `NBD_OPT_STRUCTURED_REPLY` (8)
> +
> +    Defined by the experimental `STRUCTURED_REPLY` extension; see below.
> +
>  #### Option reply types
> 
>  These values are used in the "reply type" field, sent by the server
> @@ -450,6 +480,8 @@ valid may depend on negotiation during the handshake phase.
>    set to 1 if the client requires "Force Unit Access" mode of
>    operation.  MUST NOT be set unless transmission flags included
>    `NBD_FLAG_SEND_FUA`.
> +- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `STRUCTURED_REPLY`
> +  extension; see below
> 
>  #### Request types
> 
> @@ -458,19 +490,32 @@ The following request types exist:
>  * `NBD_CMD_READ` (0)
> 
>      A read request. Length and offset define the data to be read. The
> -    server MUST reply with a reply header, followed immediately by len
> -    bytes of data, read offset bytes into the file, unless an error
> +    server MUST reply with either a simple reply or a structured
> +    reply, according to whether the experimental `STRUCTURED_REPLY`
> +    extension was negotiated.
> +
> +    If structured replies were not negotiated, the server MUST reply
> +    with a simple reply header, followed immediately by len bytes of
> +    data, read from offset bytes into the file, unless an error
>      condition has occurred.
> 
> -    If an error occurs, the server SHOULD set the appropriate error code
> -    in the error field. The server MUST then either close the
> -    connection, or send *length* bytes of data (which MAY be invalid).
> +    If an error occurs, the server SHOULD set the appropriate error
> +    code in the error field. The server MUST then either close the
> +    connection, or send *length* bytes of data (these bytes MAY be
> +    invalid, in which case they SHOULD be zero); this is true even if
> +    the error is `EINVAL` for bad flags detected before even
> +    attempting to read.
> 
>      If an error occurs while reading after the server has already sent
>      out the reply header with an error field set to zero (i.e.,
>      signalling no error), the server MUST immediately close the
>      connection; it MUST NOT send any further data to the client.
> 
> +    The experimental `STRUCTURED_REPLY` extension changes the reply
> +    from a simple reply to a structured reply, in part to allow
> +    recovery after a partial read and more efficient reads of sparse
> +    files; see below.
> +
>  * `NBD_CMD_WRITE` (1)
> 
>      A write request. Length and offset define the location and amount of
> @@ -556,6 +601,8 @@ The following error values are defined:
>  * `ENOMEM` (12), Cannot allocate memory.
>  * `EINVAL` (22), Invalid argument.
>  * `ENOSPC` (28), No space left on device.
> +* `EOVERFLOW` (75), Value too large; SHOULD NOT be sent outside of the
> +  experimental `STRUCTURED_REPLY` extension; see below.
> 
>  The server SHOULD return `ENOSPC` if it receives a write request
>  including one or more sectors beyond the size of the device.  It SHOULD
> @@ -668,6 +715,293 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
> 
> +### `STRUCTURED_REPLY` extension
> +
> +Some of the major downsides of the default simple reply to
> +`NBD_CMD_READ` are as follows.  First, it is not possible to support
> +partial reads or early errors (the command must succeed or fail as a
> +whole, and either len bytes of data must be sent or the connection
> +must be closed, even if the failure is `EINVAL` due to bad flags).
> +Second, there is no way to efficiently skip over portions of a sparse
> +file that are known to contain all zeroes.  Finally, it is not
> +possible to reliably decode the server traffic without also having
> +context of what pending read requests were sent by the client.
> +
> +To remedy this, a `STRUCTURED_REPLY` extension is envisioned. This
> +extension adds a new transmission phase message type, a new option
> +request, a new transmission flag, a new command flag, a new command
> +error, and alters the reply to the `NBD_CMD_READ` request.
> +
> +* Transmission phase
> +
> +    A structured reply in the transmission phase consists of one or
> +    more structured reply chunk messages.  The server MUST NOT send
> +    this reply type unless the client has successfully negotiated
> +    structured replies via `NBD_OPT_STRUCTURED_REPLY`.  Conversely, if
> +    structured replies are negotiated, the server MUST use a
> +    structured reply for any response with a payload, and MUST NOT use
> +    a simple reply for `NBD_CMD_READ` (even for the case of an early
> +    `EINVAL` due to bad flags).
> +
> +    A structured reply MAY occupy multiple structured chunk messages
> +    (all with the same value for "handle"), and the
> +    `NBD_REPLY_FLAG_DONE` reply flag is used to identify the final
> +    chunk.  Relative ordering of the chunks MAY be important;
> +    individual commands will document constraints on whether multiple
> +    chunks may be rearranged; however, it is always safe to interleave
> +    chunks of the reply to one request with messages related to other
> +    requests.  A server SHOULD try to minimize the number of chunks
> +    sent in a reply, but MUST NOT send the final chunk if there is
> +    still a possibility of detecting an error.  A structured reply is
> +    considered successful only if it did not contain any error chunks,
> +    although the client MAY be able to determine partial success based
> +    on the chunks received.
> +
> +    A structured reply chunk message looks as follows:
> +
> +    S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
> +    S: 16 bits, flags  
> +    S: 16 bits, type  
> +    S: 64 bits, handle  
> +    S: 32 bits, length of payload (unsigned)  
> +    S: *length* bytes of payload data (if *length* is non-zero)  
> +
> +    The use of *length* in the reply allows context-free division of
> +    the overall server traffic into individual reply messages; the
> +    *type* field describes how to further interpret the payload.
> +
> +  * Structured reply flags
> +
> +    This field of 16 bits is sent by the server as part of every
> +    structured reply.
> +
> +    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
> +      more structured reply chunks will be sent for the same client
> +      request, and MUST set this bit if this is the final reply.  This
> +      bit MUST always be set for the `NBD_REPLY_TYPE_NONE` chunk,
> +      although any other chunk type can also be used as the final
> +      chunk.
> +
> +    The server MUST NOT set any other flags without first negotiating
> +    the extension with the client, unless the client can usefully
> +    react to the response without interpreting the flag (for instance
> +    if the flag is some form of hint).  Clients MUST ignore
> +    unrecognized flags.
> +
> +  * Structured Reply types
> +
> +    These values are used in the "type" field of a structured reply.
> +    Some chunk types can additionally be categorized by role, such as
> +    *error chunks* or *content chunks*.  Each type determines how to
> +    interpret the "length" bytes of payload.  If the client receives
> +    an unknown or unexpected type, it SHOULD close the connection.
> +
> +    - `NBD_REPLY_TYPE_NONE` (0)
> +
> +      *length* MUST be 0 (and the payload field omitted).  This chunk
> +      type MUST always be used with the `NBD_REPLY_FLAG_DONE` bit set
> +      (that is, it may appear at most once in a structured reply, and
> +      is only useful as the final reply chunk).  If no earlier error
> +      chunks were sent, then this type implies that the overall client
> +      request is successful.  Valid as a reply to any request.
> +
> +    - `NBD_REPLY_TYPE_ERROR` (1)
> +
> +      This chunk type is in the error chunk category.  *length* MUST
> +      be at least 4.  This chunk represents that an error occurred,
> +      and the client MAY NOT make any assumptions about partial
> +      success. This type SHOULD NOT be used more than once in a
> +      structured reply.  Valid as a reply to any request.
> +
> +      The payload is structured as:
> +
> +      32 bits: error (MUST be nonzero)  
> +      *length - 4* bytes: (optional UTF-8 encoded data suitable for
> +         direct display to a human being, not NUL terminated)  
> +
> +    - `NBD_REPLY_TYPE_ERROR_OFFSET` (2)
> +
> +      This chunk type is in the error chunk category.  *length* MUST
> +      be at least 12.  This reply represents that an error occurred at
> +      a given offset, which MUST lie within the original offset and
> +      length of the request; the client can use this offset to
> +      determine if request had any partial success.  This chunk type
> +      MAY appear multiple times in a structured reply, although the
> +      same offset SHOULD NOT be repeated.  Likewise, if content chunks
> +      were sent earlier in the structured reply, the server SHOULD NOT
> +      send multiple distinct offsets that lie within the bounds of a
> +      single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> +      `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
> +
> +      The payload is structured as:
> +
> +      32 bits: error (MUST be nonzero)  
> +      64 bits: offset (unsigned)  
> +      *length - 12* bytes: (optional UTF-8 encoded data suitable for
> +         direct display to a human being, not NUL terminated)  
> +
> +    - `NBD_REPLY_TYPE_OFFSET_DATA` (3)
> +
> +      This chunk type is in the content chunk category.  *length* MUST
> +      be at least 9.  It represents the contents of *length - 8* bytes
> +      of the file, starting at *offset*.  The data MUST lie within the
> +      bounds of the original offset and length of the client's
> +      request, and MUST NOT overlap with the bounds of any earlier
> +      content chunk or error chunk in the same reply.  This chunk may
> +      be used more than once in a reply, unless the `NBD_CMD_FLAG_DF`
> +      flag was set.  Valid as a reply to `NBD_CMD_READ`.
> +
> +      The payload is structured as:
> +
> +      64 bits: offset (unsigned)  
> +      *length - 8* bytes: data  
> +
> +    - `NBD_REPLY_TYPE_OFFSET_HOLE` (4)
> +
> +      This chunk type is in the content chunk category.  *length* MUST
> +      be exactly 12.  It represents that the contents of *hole size*
> +      bytes starting at *offset* read as all zeroes.  The hole MUST
> +      lie within the bounds of the original offset and length of the
> +      client's request, and MUST NOT overlap with the bounds of any
> +      earlier content chunk or error chunk in the same reply.  This
> +      chunk may be used more than once in a reply, unless the
> +      `NBD_CMD_FLAG_DF` flag was set.  Valid as a reply to
> +      `NBD_CMD_READ`.
> +
> +      The payload is structured as:
> +
> +      64 bits: offset (unsigned)  
> +      32 bits: hole size (unsigned, MUST be nonzero)  
> +
> +* `NBD_OPT_STRUCTURED_REPLY`
> +
> +    The client wishes to use structured replies during the
> +    transmission phase.  The option request has no additional data.
> +
> +    The server replies with the following:
> +
> +    - `NBD_REP_ACK`: Structured replies have been negotiated; the
> +      server MUST use structured replies to the `NBD_CMD_READ`
> +      transmission request.  Other extensions that require structured
> +      replies may now be negotiated.
> +    - For backwards compatibility, clients should be prepared to also
> +      handle `NBD_REP_ERR_UNSUP`; in this case, no structured replies
> +      will be sent.
> +
> +    It is envisioned that future extensions will add other new
> +    requests that may require a data payload in the reply.  A server
> +    that supports such extensions SHOULD NOT advertise those
> +    extensions until the client negotiates structured replies; and a
> +    client MUST NOT make use of those extensions without first
> +    enabling the `NBD_OPT_STRUCTURED_REPLY` extension.
> +
> +* `NBD_FLAG_SEND_DF`
> +
> +    The server MUST set this transmission flag to 1 if the
> +    `NBD_CMD_READ` request supports the `NBD_CMD_FLAG_DF` flag, and
> +    MUST leave this flag clear if structured replies have not been
> +    negotiated.  Clients MUST NOT rely on the state of this flag prior
> +    the final flags value reported by `NBD_OPT_EXPORT_NAME` or
> +    experimental `NBD_OPT_GO`.  Additionally, clients MUST NOT set the
> +    `NBD_CMD_FLAG_DF` request flag unless this transmission flag is
> +    set.
> +
> +* `NBD_CMD_FLAG_DF`
> +
> +    The "don't fragment" flag, valid during `NBD_CMD_READ`.  SHOULD be
> +    set to 1 if the client requires the server to send at most one
> +    content chunk in reply.  MUST NOT be set unless the transmission
> +    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
> +    `EOVERFLOW` error chunk, if the request length is too large.
> +
> +* `EOVERFLOW`
> +
> +    The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
> +    client has requested `NBD_CMD_FLAG_DF` for a length that is too
> +    large to read without fragmentation.  The server MUST NOT return
> +    this error if the read request did not exceed 65,536 bytes, and
> +    SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
> +
> +* `NBD_CMD_READ`
> +
> +    If structured replies were not negotiated, then a read request
> +    MUST always be answered by a simple reply, as documented above
> +    (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
> +    length bytes of data according to the client's request, although
> +    those bytes MAY be invalid if an error is returned, and the
> +    connection MUST be closed if an error occurs after a header
> +    claiming no error).
> +
> +    If structured replies are negotiated, then a read request MUST
> +    result in a structured reply with one or more chunks (each using
> +    magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), where the final
> +    chunk has the flag `NBD_REPLY_FLAG_DONE`, and with the following
> +    additional constraints.
> +
> +    The server MAY split the reply into any number of content chunks;
> +    each chunk MUST describe at least one byte, although to minimize
> +    overhead, the server SHOULD use chunks with lengths and offsets as
> +    an integer multiple of 512 bytes, where possible (the first and
> +    last chunk of an unaligned read being the most obvious places for
> +    an exception).  The server MUST NOT content chunks that overlap
> +    with any earlier content or error chunk, and MUST NOT send chunks
> +    that describe data outside the offset and length of the request,
> +    but MAY send the chunks in any order (the client MUST reassemble
> +    content chunks into the correct order), and MAY send additional
> +    data chunks even after reporting an error chunk.  Note that a
> +    request for more than 2^32 - 8 bytes MUST be split into at least
> +    two chunks, so as not to overflow the length field of a reply
> +    while still allowing space for the offset of each chunk.  When no
> +    error is detected, the server MUST send enough data chunks to
> +    cover the entire region described by the offset and length of the
> +    client's request.
> +
> +    To minimize traffic, the server MAY use a content or error chunk
> +    as the final chunk by setting the `NBD_REPLY_FLAG_DONE` flag, but
> +    MUST NOT do so for a content chunk if it would still be possible
> +    to detect an error while transmitting the chunk.  The
> +    `NBD_REPLY_TYPE_DONE` chunk is always acceptable as the final
> +    chunk.
> +
> +    If an error is detected, the server MUST still complete the
> +    transmission of any current chunk (it SHOULD use padding bytes of
> +    zero for any remaining data portion of a chunk with type
> +    `NBD_REPLY_TYPE_OFFSET_DATA`), but MAY omit further content
> +    chunks.  The server MUST include an error chunk as one of the
> +    subsequent chunks, but MAY defer the error reporting behind other
> +    queued chunks.  An error chunk of type `NBD_REPLY_TYPE_ERROR`
> +    implies that the client MAY NOT make any assumptions about
> +    validity of data chunks, and if used, SHOULD be the only error
> +    chunk in the reply.  On the other hand, an error chunk of type
> +    `NBD_REPLY_TYPE_ERROR_OFFSET` gives fine-grained information about
> +    which earlier data chunk(s) encountered a failure, and MAY also be
> +    sent in lieu of a data chunk; as such, a server MAY still usefully
> +    follow it with further non-overlapping content chunks or with
> +    error offsets for other content chunks.  Generally, a server
> +    SHOULD NOT mix errors with offsets with a generic error.  As long
> +    as all errors are accompanied by offsets, the client MAY assume
> +    that any data chunks with no subsequent error offset are valid,
> +    that chunks with an overlapping error offset errors are valid up
> +    until the reported offset, and that portions of the read that do
> +    not have a corresponding content chunk are not valid.
> +
> +    A client MAY close the connection if it detects that the server
> +    has sent invalid chunks (such as overlapping data, or not enough
> +    data before claiming success).
> +
> +    In order to avoid the burden of reassembly, the client MAY set the
> +    `NBD_CMD_FLAG_DF` flag ("don't fragment").  If this flag is set,
> +    the server MUST send at most one content chunk, although it MAY
> +    still send multiple chunks (the remaining chunks would be error
> +    chunks or a final type of `NBD_REPLY_TYPE_NONE`).  If the area
> +    being read contains both data and a hole, the server MUST use
> +    `NBD_REPLY_TYPE_OFFSET_DATA` with the zeroes explicitly present.
> +    A server MAY reject a client's request with the error `EOVERFLOW`
> +    if the length is too large to send without fragmentation, in which
> +    case it MUST NOT send a content chunk; however, the server MUST
> +    NOT use this error if the client's requested length does not
> +    exceed 65,536 bytes.
> +
>  ## About this file
> 
>  This file tries to document the NBD protocol as it is currently
> -- 
> 2.5.5
> 
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
> 

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Reply to: