Re: [Nbd] [PATCH v4] doc: Propose STRUCTURED_REPLY extension
- To: Eric Blake <eblake@...696...>
- Cc: nbd-general@lists.sourceforge.net
- Subject: Re: [Nbd] [PATCH v4] doc: Propose STRUCTURED_REPLY extension
- From: Wouter Verhelst <w@...112...>
- Date: Fri, 1 Apr 2016 10:41:37 +0200
- Message-id: <20160401084137.GG25514@...3...>
- In-reply-to: <1459488588-11175-1-git-send-email-eblake@...696...>
- References: <1459488588-11175-1-git-send-email-eblake@...696...>
I'm feeling strongly enough that this is close to ready that I'm
probably going to merge it. It's still clearly marked as experimental,
so we can still easily fix things, but I doubt we're going to need much
more than tweaks.
(obviously, with the three other patches that I just merged, this
conflicts. Ah well)
On Thu, Mar 31, 2016 at 11:29:48PM -0600, Eric Blake wrote:
> The existing transmission phase protocol is difficult to sniff,
> because correct interpretation of the server stream requires
> context from the client stream (or risks false positives if
> data payloads happen to contain the protocol magic numbers). It
> also prohibits the ability to do efficient sparse reads, or to
> return a short read where an error is reported without also
> sending length bytes of (bogus) data.
>
> Remedy this by adding a new option request negotiation, which
> affects the response of the NBD_CMD_READ command, and sets
> forth rules for how future command responses must behave when
> they carry a data payload. It also makes it possible to return
> UTF-8 human-readable messages alongside an error code; and
> therefore structured replies are permitted for all commands.
>
> Signed-off-by: Eric Blake <eblake@...696...>
> ---
>
> In v4: rearrange paragraphs a bit, document 'structured reply chunk
> message' up front while still deferring details to the experimental
> section, lots of wording tweaks, remove options, add optional UTF-8
> error text to error chunks
>
> doc/proto.md | 370 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 352 insertions(+), 18 deletions(-)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index 2600098..5f2fc02 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -184,17 +184,31 @@ required to.
>
> ### Transmission
>
> -There are two message types in the transmission phase: the request,
> -and the reply. The phase consists of a series of transactions, where
> -the client submits requests and the server sends corresponding
> -replies, with a single reply message per request, and continues until
> -either side closes the connection.
> +There are three message types in the transmission phase: the request,
> +the simple reply, and the experimental structured reply chunk. The
> +transmission phase consists of a series of transactions, where the
> +client submits requests and the server sends corresponding replies
> +with either a single simple reply or a series of one or more
> +structured reply chunks per request. The phase continues until either
> +side closes the connection.
> +
> +Note that without client negotiation, the server MUST use only simple
> +replies, and that it is impossible to tell by reading the server
> +traffic in isolation whether a data field will be present; the simple
> +reply is also problematic for error handling of the `NBD_CMD_READ`
> +request. Therefore, the experimental `STRUCTURED_REPLY` extension
> +creates a context-free server stream by introducing the use of
> +structured reply chunks; see below.
>
> Replies need not be sent in the same order as requests (i.e., requests
> -may be handled by the server asynchronously). Clients SHOULD use a
> -handle that is distinct from all other currently pending transactions,
> -but MAY reuse handles that are no longer in flight; handles need not
> -be consecutive. In each reply, the server MUST use the same value for
> +may be handled by the server asynchronously), and structured reply
> +chunks from one request may be interleaved with reply messages from
> +other requests; however, there may be constraints that prevent
> +arbitrary reordering of structured reply chunks within a given reply.
> +Clients SHOULD use a handle that is distinct from all other currently
> +pending transactions, but MAY reuse handles that are no longer in
> +flight; handles need not be consecutive. In each reply message
> +(whether simple or structured), the server MUST use the same value for
> handle as was sent by the client in the corresponding request. In
> this way, the client can correlate which request is receiving a
> response.
> @@ -211,15 +225,25 @@ C: 64 bits, offset (unsigned)
> C: 32 bits, length (unsigned)
> C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
>
> -#### Reply message
> +#### Simple reply message
>
> -The server replies with:
> +The simple reply message MUST be sent by the server in response to all
> +requests if the experimental `STRUCTURED_REPLY` extension was not
> +negotiated. If structured replies have been negotiated, a simple
> +reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> +but only if the reply has no data payload. The message looks as
> +follows:
>
> -S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`)
> -S: 32 bits, error
> +S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`)
> +S: 32 bits, error (MAY be zero)
> S: 64 bits, handle
> S: (*length* bytes of data if the request is of type `NBD_CMD_READ`)
>
> +#### Structured reply chunk message
> +
> +This reply type MUST NOT be used except as documented by the
> +experimental `STRUCTURED_READ` extension; see below.
> +
> ## Values
>
> This section describes the value and meaning of constants (other than
> @@ -263,6 +287,8 @@ immediately after the handshake flags field in oldstyle negotiation:
> schedule I/O accesses as for a rotational medium
> - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
> `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_DF`; defined by the `STRUCTURED_REPLY` extension;
> + see below.
>
> Clients SHOULD ignore unknown flags.
>
> @@ -351,6 +377,10 @@ of the newstyle negotiation.
>
> Defined by the experimental `SELECT` extension; see below.
>
> +- `NBD_OPT_STRUCTURED_REPLY` (8)
> +
> + Defined by the experimental `STRUCTURED_REPLY` extension; see below.
> +
> #### Option reply types
>
> These values are used in the "reply type" field, sent by the server
> @@ -450,6 +480,8 @@ valid may depend on negotiation during the handshake phase.
> set to 1 if the client requires "Force Unit Access" mode of
> operation. MUST NOT be set unless transmission flags included
> `NBD_FLAG_SEND_FUA`.
> +- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `STRUCTURED_REPLY`
> + extension; see below
>
> #### Request types
>
> @@ -458,19 +490,32 @@ The following request types exist:
> * `NBD_CMD_READ` (0)
>
> A read request. Length and offset define the data to be read. The
> - server MUST reply with a reply header, followed immediately by len
> - bytes of data, read offset bytes into the file, unless an error
> + server MUST reply with either a simple reply or a structured
> + reply, according to whether the experimental `STRUCTURED_REPLY`
> + extension was negotiated.
> +
> + If structured replies were not negotiated, the server MUST reply
> + with a simple reply header, followed immediately by len bytes of
> + data, read from offset bytes into the file, unless an error
> condition has occurred.
>
> - If an error occurs, the server SHOULD set the appropriate error code
> - in the error field. The server MUST then either close the
> - connection, or send *length* bytes of data (which MAY be invalid).
> + If an error occurs, the server SHOULD set the appropriate error
> + code in the error field. The server MUST then either close the
> + connection, or send *length* bytes of data (these bytes MAY be
> + invalid, in which case they SHOULD be zero); this is true even if
> + the error is `EINVAL` for bad flags detected before even
> + attempting to read.
>
> If an error occurs while reading after the server has already sent
> out the reply header with an error field set to zero (i.e.,
> signalling no error), the server MUST immediately close the
> connection; it MUST NOT send any further data to the client.
>
> + The experimental `STRUCTURED_REPLY` extension changes the reply
> + from a simple reply to a structured reply, in part to allow
> + recovery after a partial read and more efficient reads of sparse
> + files; see below.
> +
> * `NBD_CMD_WRITE` (1)
>
> A write request. Length and offset define the location and amount of
> @@ -556,6 +601,8 @@ The following error values are defined:
> * `ENOMEM` (12), Cannot allocate memory.
> * `EINVAL` (22), Invalid argument.
> * `ENOSPC` (28), No space left on device.
> +* `EOVERFLOW` (75), Value too large; SHOULD NOT be sent outside of the
> + experimental `STRUCTURED_REPLY` extension; see below.
>
> The server SHOULD return `ENOSPC` if it receives a write request
> including one or more sectors beyond the size of the device. It SHOULD
> @@ -668,6 +715,293 @@ option reply type.
> message if they do not also send it as a reply to the
> `NBD_OPT_SELECT` message.
>
> +### `STRUCTURED_REPLY` extension
> +
> +Some of the major downsides of the default simple reply to
> +`NBD_CMD_READ` are as follows. First, it is not possible to support
> +partial reads or early errors (the command must succeed or fail as a
> +whole, and either len bytes of data must be sent or the connection
> +must be closed, even if the failure is `EINVAL` due to bad flags).
> +Second, there is no way to efficiently skip over portions of a sparse
> +file that are known to contain all zeroes. Finally, it is not
> +possible to reliably decode the server traffic without also having
> +context of what pending read requests were sent by the client.
> +
> +To remedy this, a `STRUCTURED_REPLY` extension is envisioned. This
> +extension adds a new transmission phase message type, a new option
> +request, a new transmission flag, a new command flag, a new command
> +error, and alters the reply to the `NBD_CMD_READ` request.
> +
> +* Transmission phase
> +
> + A structured reply in the transmission phase consists of one or
> + more structured reply chunk messages. The server MUST NOT send
> + this reply type unless the client has successfully negotiated
> + structured replies via `NBD_OPT_STRUCTURED_REPLY`. Conversely, if
> + structured replies are negotiated, the server MUST use a
> + structured reply for any response with a payload, and MUST NOT use
> + a simple reply for `NBD_CMD_READ` (even for the case of an early
> + `EINVAL` due to bad flags).
> +
> + A structured reply MAY occupy multiple structured chunk messages
> + (all with the same value for "handle"), and the
> + `NBD_REPLY_FLAG_DONE` reply flag is used to identify the final
> + chunk. Relative ordering of the chunks MAY be important;
> + individual commands will document constraints on whether multiple
> + chunks may be rearranged; however, it is always safe to interleave
> + chunks of the reply to one request with messages related to other
> + requests. A server SHOULD try to minimize the number of chunks
> + sent in a reply, but MUST NOT send the final chunk if there is
> + still a possibility of detecting an error. A structured reply is
> + considered successful only if it did not contain any error chunks,
> + although the client MAY be able to determine partial success based
> + on the chunks received.
> +
> + A structured reply chunk message looks as follows:
> +
> + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
> + S: 16 bits, flags
> + S: 16 bits, type
> + S: 64 bits, handle
> + S: 32 bits, length of payload (unsigned)
> + S: *length* bytes of payload data (if *length* is non-zero)
> +
> + The use of *length* in the reply allows context-free division of
> + the overall server traffic into individual reply messages; the
> + *type* field describes how to further interpret the payload.
> +
> + * Structured reply flags
> +
> + This field of 16 bits is sent by the server as part of every
> + structured reply.
> +
> + - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
> + more structured reply chunks will be sent for the same client
> + request, and MUST set this bit if this is the final reply. This
> + bit MUST always be set for the `NBD_REPLY_TYPE_NONE` chunk,
> + although any other chunk type can also be used as the final
> + chunk.
> +
> + The server MUST NOT set any other flags without first negotiating
> + the extension with the client, unless the client can usefully
> + react to the response without interpreting the flag (for instance
> + if the flag is some form of hint). Clients MUST ignore
> + unrecognized flags.
> +
> + * Structured Reply types
> +
> + These values are used in the "type" field of a structured reply.
> + Some chunk types can additionally be categorized by role, such as
> + *error chunks* or *content chunks*. Each type determines how to
> + interpret the "length" bytes of payload. If the client receives
> + an unknown or unexpected type, it SHOULD close the connection.
> +
> + - `NBD_REPLY_TYPE_NONE` (0)
> +
> + *length* MUST be 0 (and the payload field omitted). This chunk
> + type MUST always be used with the `NBD_REPLY_FLAG_DONE` bit set
> + (that is, it may appear at most once in a structured reply, and
> + is only useful as the final reply chunk). If no earlier error
> + chunks were sent, then this type implies that the overall client
> + request is successful. Valid as a reply to any request.
> +
> + - `NBD_REPLY_TYPE_ERROR` (1)
> +
> + This chunk type is in the error chunk category. *length* MUST
> + be at least 4. This chunk represents that an error occurred,
> + and the client MAY NOT make any assumptions about partial
> + success. This type SHOULD NOT be used more than once in a
> + structured reply. Valid as a reply to any request.
> +
> + The payload is structured as:
> +
> + 32 bits: error (MUST be nonzero)
> + *length - 4* bytes: (optional UTF-8 encoded data suitable for
> + direct display to a human being, not NUL terminated)
> +
> + - `NBD_REPLY_TYPE_ERROR_OFFSET` (2)
> +
> + This chunk type is in the error chunk category. *length* MUST
> + be at least 12. This reply represents that an error occurred at
> + a given offset, which MUST lie within the original offset and
> + length of the request; the client can use this offset to
> + determine if request had any partial success. This chunk type
> + MAY appear multiple times in a structured reply, although the
> + same offset SHOULD NOT be repeated. Likewise, if content chunks
> + were sent earlier in the structured reply, the server SHOULD NOT
> + send multiple distinct offsets that lie within the bounds of a
> + single content chunk. Valid as a reply to `NBD_CMD_READ`,
> + `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
> +
> + The payload is structured as:
> +
> + 32 bits: error (MUST be nonzero)
> + 64 bits: offset (unsigned)
> + *length - 12* bytes: (optional UTF-8 encoded data suitable for
> + direct display to a human being, not NUL terminated)
> +
> + - `NBD_REPLY_TYPE_OFFSET_DATA` (3)
> +
> + This chunk type is in the content chunk category. *length* MUST
> + be at least 9. It represents the contents of *length - 8* bytes
> + of the file, starting at *offset*. The data MUST lie within the
> + bounds of the original offset and length of the client's
> + request, and MUST NOT overlap with the bounds of any earlier
> + content chunk or error chunk in the same reply. This chunk may
> + be used more than once in a reply, unless the `NBD_CMD_FLAG_DF`
> + flag was set. Valid as a reply to `NBD_CMD_READ`.
> +
> + The payload is structured as:
> +
> + 64 bits: offset (unsigned)
> + *length - 8* bytes: data
> +
> + - `NBD_REPLY_TYPE_OFFSET_HOLE` (4)
> +
> + This chunk type is in the content chunk category. *length* MUST
> + be exactly 12. It represents that the contents of *hole size*
> + bytes starting at *offset* read as all zeroes. The hole MUST
> + lie within the bounds of the original offset and length of the
> + client's request, and MUST NOT overlap with the bounds of any
> + earlier content chunk or error chunk in the same reply. This
> + chunk may be used more than once in a reply, unless the
> + `NBD_CMD_FLAG_DF` flag was set. Valid as a reply to
> + `NBD_CMD_READ`.
> +
> + The payload is structured as:
> +
> + 64 bits: offset (unsigned)
> + 32 bits: hole size (unsigned, MUST be nonzero)
> +
> +* `NBD_OPT_STRUCTURED_REPLY`
> +
> + The client wishes to use structured replies during the
> + transmission phase. The option request has no additional data.
> +
> + The server replies with the following:
> +
> + - `NBD_REP_ACK`: Structured replies have been negotiated; the
> + server MUST use structured replies to the `NBD_CMD_READ`
> + transmission request. Other extensions that require structured
> + replies may now be negotiated.
> + - For backwards compatibility, clients should be prepared to also
> + handle `NBD_REP_ERR_UNSUP`; in this case, no structured replies
> + will be sent.
> +
> + It is envisioned that future extensions will add other new
> + requests that may require a data payload in the reply. A server
> + that supports such extensions SHOULD NOT advertise those
> + extensions until the client negotiates structured replies; and a
> + client MUST NOT make use of those extensions without first
> + enabling the `NBD_OPT_STRUCTURED_REPLY` extension.
> +
> +* `NBD_FLAG_SEND_DF`
> +
> + The server MUST set this transmission flag to 1 if the
> + `NBD_CMD_READ` request supports the `NBD_CMD_FLAG_DF` flag, and
> + MUST leave this flag clear if structured replies have not been
> + negotiated. Clients MUST NOT rely on the state of this flag prior
> + the final flags value reported by `NBD_OPT_EXPORT_NAME` or
> + experimental `NBD_OPT_GO`. Additionally, clients MUST NOT set the
> + `NBD_CMD_FLAG_DF` request flag unless this transmission flag is
> + set.
> +
> +* `NBD_CMD_FLAG_DF`
> +
> + The "don't fragment" flag, valid during `NBD_CMD_READ`. SHOULD be
> + set to 1 if the client requires the server to send at most one
> + content chunk in reply. MUST NOT be set unless the transmission
> + flags include `NBD_FLAG_SEND_DF`. Use of this flag MAY trigger an
> + `EOVERFLOW` error chunk, if the request length is too large.
> +
> +* `EOVERFLOW`
> +
> + The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
> + client has requested `NBD_CMD_FLAG_DF` for a length that is too
> + large to read without fragmentation. The server MUST NOT return
> + this error if the read request did not exceed 65,536 bytes, and
> + SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
> +
> +* `NBD_CMD_READ`
> +
> + If structured replies were not negotiated, then a read request
> + MUST always be answered by a simple reply, as documented above
> + (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
> + length bytes of data according to the client's request, although
> + those bytes MAY be invalid if an error is returned, and the
> + connection MUST be closed if an error occurs after a header
> + claiming no error).
> +
> + If structured replies are negotiated, then a read request MUST
> + result in a structured reply with one or more chunks (each using
> + magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), where the final
> + chunk has the flag `NBD_REPLY_FLAG_DONE`, and with the following
> + additional constraints.
> +
> + The server MAY split the reply into any number of content chunks;
> + each chunk MUST describe at least one byte, although to minimize
> + overhead, the server SHOULD use chunks with lengths and offsets as
> + an integer multiple of 512 bytes, where possible (the first and
> + last chunk of an unaligned read being the most obvious places for
> + an exception). The server MUST NOT content chunks that overlap
> + with any earlier content or error chunk, and MUST NOT send chunks
> + that describe data outside the offset and length of the request,
> + but MAY send the chunks in any order (the client MUST reassemble
> + content chunks into the correct order), and MAY send additional
> + data chunks even after reporting an error chunk. Note that a
> + request for more than 2^32 - 8 bytes MUST be split into at least
> + two chunks, so as not to overflow the length field of a reply
> + while still allowing space for the offset of each chunk. When no
> + error is detected, the server MUST send enough data chunks to
> + cover the entire region described by the offset and length of the
> + client's request.
> +
> + To minimize traffic, the server MAY use a content or error chunk
> + as the final chunk by setting the `NBD_REPLY_FLAG_DONE` flag, but
> + MUST NOT do so for a content chunk if it would still be possible
> + to detect an error while transmitting the chunk. The
> + `NBD_REPLY_TYPE_DONE` chunk is always acceptable as the final
> + chunk.
> +
> + If an error is detected, the server MUST still complete the
> + transmission of any current chunk (it SHOULD use padding bytes of
> + zero for any remaining data portion of a chunk with type
> + `NBD_REPLY_TYPE_OFFSET_DATA`), but MAY omit further content
> + chunks. The server MUST include an error chunk as one of the
> + subsequent chunks, but MAY defer the error reporting behind other
> + queued chunks. An error chunk of type `NBD_REPLY_TYPE_ERROR`
> + implies that the client MAY NOT make any assumptions about
> + validity of data chunks, and if used, SHOULD be the only error
> + chunk in the reply. On the other hand, an error chunk of type
> + `NBD_REPLY_TYPE_ERROR_OFFSET` gives fine-grained information about
> + which earlier data chunk(s) encountered a failure, and MAY also be
> + sent in lieu of a data chunk; as such, a server MAY still usefully
> + follow it with further non-overlapping content chunks or with
> + error offsets for other content chunks. Generally, a server
> + SHOULD NOT mix errors with offsets with a generic error. As long
> + as all errors are accompanied by offsets, the client MAY assume
> + that any data chunks with no subsequent error offset are valid,
> + that chunks with an overlapping error offset errors are valid up
> + until the reported offset, and that portions of the read that do
> + not have a corresponding content chunk are not valid.
> +
> + A client MAY close the connection if it detects that the server
> + has sent invalid chunks (such as overlapping data, or not enough
> + data before claiming success).
> +
> + In order to avoid the burden of reassembly, the client MAY set the
> + `NBD_CMD_FLAG_DF` flag ("don't fragment"). If this flag is set,
> + the server MUST send at most one content chunk, although it MAY
> + still send multiple chunks (the remaining chunks would be error
> + chunks or a final type of `NBD_REPLY_TYPE_NONE`). If the area
> + being read contains both data and a hole, the server MUST use
> + `NBD_REPLY_TYPE_OFFSET_DATA` with the zeroes explicitly present.
> + A server MAY reject a client's request with the error `EOVERFLOW`
> + if the length is too large to send without fragmentation, in which
> + case it MUST NOT send a content chunk; however, the server MUST
> + NOT use this error if the client's requested length does not
> + exceed 65,536 bytes.
> +
> ## About this file
>
> This file tries to document the NBD protocol as it is currently
> --
> 2.5.5
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>
--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12
Reply to: