Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension

To: Eric Blake <eblake@...696...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Wouter Verhelst <w@...112...>
Subject: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
From: Alex Bligh <alex@...872...>
Date: Thu, 31 Mar 2016 12:23:11 +0100
Message-id: <A6C62F41-19EF-4C8A-AC3F-031903A0F9F3@...872...>
In-reply-to: <1459404384-5258-5-git-send-email-eblake@...696...>
References: <1459404384-5258-1-git-send-email-eblake@...696...> <1459404384-5258-5-git-send-email-eblake@...696...>
Eric,

We're getting there! I'm only going to comment on this one rather than
the 'final form' patch, so we get this one right to start off with.

On 31 Mar 2016, at 07:06, Eric Blake <eblake@...696...> wrote:
> ### Transmission
> 
> There are two message types in the transmission phase: the request,
> -and the reply.  The phase consists of a series of transactions, where
> -the client submits requests and the server sends corresponding
> -replies, with a single reply message per request, and continues until
> -either side closes the connection.
> +and the simple reply.


But if this describes transmissions in general, shouldn't this be
"the request, and the reply; the reply may be either a simple
reply or a structured reply (for which is permissible under
which circumstances, see below)"

The existing wording suggests structured replies are not part of
the transmission phase.

> +The phase consists of a series of transactions,
> +where the client submits requests and the server sends corresponding
> +replies, with a single simple reply message per request, and continues

Again "with a single reply per request (either simple or structured)"

> +until either side closes the connection.
> 
> Replies need not be sent in the same order as requests (i.e., requests
> may be handled by the server asynchronously).  Clients SHOULD use a
> handle that is distinct from all other currently pending transactions,
> but MAY reuse handles that are no longer in flight; handles need not
> -be consecutive.  In each reply, the server MUST use the same value for
> -handle as was sent by the client in the corresponding request.  In
> -this way, the client can correlate which request is receiving a
> -response.
> +be consecutive.  In each reply message, the server MUST use the same
> +value for handle as was sent by the client in the corresponding
> +request.  In this way, the client can correlate which request is
> +receiving a response.
> +
> +Note that it is impossible to tell by reading just the server traffic
> +whether a data field of a simple reply will be present; the simple
> +reply is also problematic for error handling of the `NBD_CMD_READ`
> +request.  Therefore, the experimental `STRUCTURED_REPLY` extension
> +creates a context-free server stream by adding an additional
> +structured reply type, and documents that it is possible to have
> +multiple structured reply messages (called chunks) in response to a
> +single request message; see below.
> 
> #### Request message
> 
> @@ -209,12 +218,30 @@ C: 64 bits, offset (unsigned)
> C: 32 bits, length (unsigned)  
> C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> 
> -#### Reply message
> +#### Simple reply message
> 
> -The server replies with:
> +[option #A1 - only replies with payload are affected]
> +The simple reply message MUST be sent by the server in response to a
> +request that requires no data payload.

Well that's *definitely* wrong (even if I wrote it!) as it implies
that if structured replies have been negotiated, you must still
send a simple reply if the result is an error. That can't be right!
I think you (I?) mean "in response to a request that never returns
a data payload".

>  It MUST also be used for the
> +`NBD_CMD_READ` command if the experimental `STRUCTURED_REPLY`
> +extension was not negotiated.  The message looks as follows:
> 
> -S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`)  
> -S: 32 bits, error  
> +[option #A2 - enabling structured replies MAY affect all other commands]
> +The simple reply message MUST be sent by the server in response to all
> +requests if the experimental `STRUCTURED_REPLY` extension was not
> +negotiated.  It MAY also be used for requests that require no data
> +payload, even when structured replies are in use.

s/in use/have been negotiated/

>  The message looks
> +as follows:
> +
> +[option #A3 - enabling structured replies MUST affect all commands]
> +The simple reply message MUST be sent by the server in response to all
> +requests if the experimental `STRUCTURED_REPLY` extension was not
> +negotiated,

"not negotiated" should read "negotiated"!

> and MUST NOT be sent otherwise.  The message looks as
> +follows:

#A2 and #A3 are my preference.

I'm not sure whether Wouter's veto of A3 was due to the complete
backwards logic!

I can't help thinking that if the structured reply thing is an
experimental extension, the wording on when it should be used would
be better down with where it describes the extension.

> * `NBD_CMD_READ` (0)
> 
>     A read request. Length and offset define the data to be read. The
> -    server MUST reply with a reply header, followed immediately by len
> -    bytes of data, read offset bytes into the file, unless an error
> -    condition has occurred.
> +    server MUST reply with a simple reply header, followed immediately
> +    by len bytes of data, read from offset bytes into the file, unless
> +    an error condition has occurred.

I think: "The server MUST reply with either a simple reply or a
structured reply; which is permissible is defined elsewhere in
this document. In the case of a simple reply, the reply consists
of a simple reply header ..."

(structured replies are defined elsewhere)

> 
>     If an error occurs, the server SHOULD set the appropriate error code
>     in the error field. The server MUST then either close the
> @@ -469,13 +504,18 @@ The following request types exist:
>     signalling no error), the server MUST immediately close the
>     connection; it MUST NOT send any further data to the client.
> 
> +    The experimental `STRUCTURED_REPLY` extension changes from a
> +    simple reply to a structured reply,

Changes what?

I think better "changes replies under some circumstances from a simple
reply to a structured reply"

> in part to allow recovery
> +    after a partial read and more efficient reads of sparse files; see
> +    below.
> +
> * `NBD_CMD_WRITE` (1)
> 
>     A write request. Length and offset define the location and amount of
>     data to be written. The client MUST follow the request header with
>     *length* number of bytes to be written to the device.
> 
> -    The server MUST write the data to disk, and then send the reply
> +    The server MUST write the data to disk, and then send the simple reply
>     message.

This depends whether we choose #A1, #A2, or #A3. I think it's better
to just refer to "reply" here.

> The server MAY send the reply message before the data has
>     reached permanent storage.
> 
> @@ -500,7 +540,7 @@ The following request types exist:
> * `NBD_CMD_FLUSH` (3)
> 
>     A flush request; a write barrier. The server MUST NOT send a
> -    successful reply header for this request before all write requests
> +    successful simple reply header for this request before all write requests

This depends whether we choose #A1, #A2, or #A3. I think it's better
to just refer to "reply" here.

>     for which a reply has already been sent to the client have reached
>     permanent storage (using fsync() or similar).
> 
> @@ -554,6 +594,8 @@ The following error values are defined:
> * `ENOMEM` (12), Cannot allocate memory.
> * `EINVAL` (22), Invalid argument.
> * `ENOSPC` (28), No space left on device.
> +* `EOVERFLOW` (75), Value too large; MUST NOT be sent outside of the
> +  experimental `STRUCTURED_REPLY` extension; see below.
> 
> The server SHOULD return `ENOSPC` if it receives a write request
> including one or more sectors beyond the size of the device.  It SHOULD
> @@ -654,6 +696,321 @@ option reply type.
>       message if they do not also send it as a reply to the
>       `NBD_OPT_SELECT` message.
> 
> +### `STRUCTURED_REPLY` extension
> +
> +Some of the major downsides of the default simple reply to
> +`NBD_CMD_READ` are as follows.  First, it is not possible to support
> +partial reads (the command must succeed or fail as a whole, either len
> +bytes of data must be sent or the connection must be closed).  There

s/There/Second, there/

> +is no way to efficiently skip over portions of a sparse file that are
> +known to contain all zeroes.  Finally, it is not possible to reliably
> +decode the server traffic without also having context of what pending
> +read requests were sent by the client.
> +
> +To remedy this, a `STRUCTURED_REPLY` extension is envisioned. This
> +extension adds a new option request, a new transmission flag, a new
> +reply type during the transmission phase, a new command flag, a new
> +command error, and alters the reply to the `NBD_CMD_READ` request.
> +
> +* `NBD_OPT_STRUCTURED_REPLY`
> +
> +    The client wishes to use structured replies during the
> +    transmission phase.  The option request has no additional data.
> +
> +    The server replies with the following:
> +
> +    - `NBD_REP_ACK`: Structured replies have been negotiated; the server
> +      MUST set the `NBD_FLAG_SEND_DF` flag in all future transmission
> +      flags, and MUST use structured replies to the `NBD_CMD_READ`
> +      transmission request.  Further extensions that use structured
> +      replies may now be negotiated.
> +    - For backwards compatibility, clients should be prepared to also
> +      handle `NBD_REP_ERR_UNSUP`; in this case, no structured replies
> +      will be sent.
> +
> +    It is envisioned that future extensions will add other new
> +    requests that also require a data payload in the reply.  Such
> +    extensions MUST use a structured reply, and not a simple reply.  A
> +    server that supports such extensions MUST NOT advertise those
> +    extensions until the client negotiates structured replies; and a
> +    client MUST NOT make use of those extensions without first
> +    enabling the `NBD_OPT_STRUCTURED_REPLY` extension.
> +
> +* `NBD_FLAG_SEND_DF`
> +
> +    [option #B1 - transmission flags always mirror current state;
> +    state change can be observed if negotiation happens after
> +    NBD_OPT_LIST]
> +    The server MUST set this transmission flag to 1 if structured
> +    replies have been negotiated, and MUST NOT set this flag
> +    otherwise; that way, the client MAY reliably use this flag as a
> +    reliable witness of whether to expect a simple reply or structured
> +    reply to the `NBD_CMD_READ` transmission request.
> +
> +    [option #B2 - final transmission flags are accurate, but
> +    intermediate transmission flags can anticipate negotiation; state
> +    change can be observed if negotiation does not happen]
> +    When responding to the `NBD_OPT_EXPORT_NAME` option request (or
> +    the `NBD_OPT_SELECT` request of the experimental `SELECT`
> +    extension), the server MUST set this transmission flag to 1 if
> +    structured replies have been negotiated, and MUST NOT set this
> +    flag otherwise; that way, the client MAY reliably use the final
> +    state of this flag as a reliable witness of whether to expect a
> +    simple reply or structured reply to the `NBD_CMD_READ`
> +    transmission request.  When responding to the `NBD_OPT_LIST`
> +    option request, the server MAY set this transmission flag, even if
> +    structured replies have not yet been negotiated.

This is an export flag, yes? Save in oldstyle negotiation (which surely
no one cares about as structured replies can't have been negotiated) the
export flags are sent after option haggling, in which case we always
know whether to set this - we set it to 1 if structured replies have
been negotiated, or to 0 otherwise. That's all this needs to say.

Therefore the difference in text merely comes down to its additional
use in the SELECT extension. That can be covered by a simple section
in the SELECT extension saying "The value of NBD_FLAG_SEND_DF in
the select extension cannot be relied upon by the Client and MUST
be set to zero by the server. This is because at the stage the
SELECT extension is negotiated, structured replies may or may not
have been negotiated".

Could someone recap why we need this flag? We know when DF can be
sent, it's exactly when structured replies have been negotiated.
I think Wouter suggested there should be a 1:1 relationship between
SEND_ flags and CMD_ flags for convenience of user space. But that
seems odd as there isn't at the moment. For instance SEND_FUA is
on bit 3, CMD_FUA is on bit 0 or (cough) 16. None of the others
have CMD flag equivalents as far as I can see.

My vote would be delete the flag (again)!

> +    Unless explicitly documented for a given request, a structured
> +    reply MUST occupy only one message (similar to a simple reply).

I think that's unnecessary. If you are going to allow structured
replies, I see no reason why you can't reply with an error chunk,
then a final 'none' chunk with the 'end' bit set. Apologies if I
made this point before and it was rejected.

> +    However, some requests document that a structured reply MAY occupy
> +    multiple chunks; each chunk uses a structured reply message (all
> +    with the same value for "handle"), and the `NBD_REPLY_FLAG_DONE`
> +    reply flag is used to identify the final chunk.  Where multiple
> +    chunks are permitted, the intermediate chunks MAY be reordered
> +    within constraints documented by the request, and the chunks MAY
> +    be interleaved with messages from other pending transactions; but
> +    the final chunk MUST always end the reply.

The above change would simplify this.

> +    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
> +      more structured reply chunks will be sent for the same client
> +      request, and MUST set this bit if this is the final reply.  This
> +      flag must always be set in response to requests which are
> +      documented as using a structured reply, but not documented as
> +      permitting multiple chunks.

Ditto

> +    The server MUST NOT set any other flags without first negotiating
> +    the extension with the client.  Clients that receive an
> +    unrecognized flag SHOULD close the connection.

An alternative formulation would be that:

The server MUST NOT set any other flags without negotiating the
extension with the client, unless the client can perform the
action requested without interpreting the flag (for instance if
the flag is some form of hint). Clients MUST ignore unrecognised
flags.

This leaves it open for the server to negotiate additional flags
but means simple flags that are hints do not require negotiation
changes.

> +    - `NBD_REPLY_TYPE_ERROR` (1)
> +
> +      This reply type represents an error chunk.  *length* MUST be
> +      exactly 4.  The payload is structured as:
> +
> +      32 bits: error (MUST be nonzero)  
> +
> +      This reply represents that an error occurred, and the client MAY
> +      NOT make any assumptions about partial success.

"This reply represents the occurrence of an error. The client MUST NOT
make any assumptions as to partial success and MUST treat all data
received (if any) as potentially invalid."

> This type SHOULD
> +      NOT be used unless it is the final reply chunk (where the flag
> +      `NBD_REPLY_FLAG_DONE` is set), or if it is immediately followed
> +      by a chunk with type `NBD_REPLY_TYPE_NONE`.

I know this is only a 'SHOULD' but I don't see any particular reason for
this distinction. Going back to the Ceph example, it sets of 10
reads in parallel, and gets an error. Why not simply send an error
packet, then when everything is done, send a 'NONE' with 'DONE' set?
Sounds reasonable to me.

You don't specifically prohibit more than one NBD_REPLY_TYPE_ERROR which
is probably a good thing, but if you intended to, you should document
it.

> +
> +    - `NBD_REPLY_TYPE_ERROR_OFFSET` (2)
> +
> +      This reply type represents an error chunk.  *length* MUST be
> +      exactly 12.  The payload is structured as:
> +
> +      32 bits: error (MUST be nonzero)  
> +      64 bits: offset (unsigned)  
> +
> +      In addition to declaring that an error occurred, this type
> +      provides enough additional information to inform the client
> +      about any partial success.  *offset* MUST lie within the bounds
> +      of the original offset and length of the client's request.  If
> +      *offset* also lies within the bounds of an earlier data chunk of
> +      the same reply, then the client MAY assume that data within that
> +      earlier chunk is valid (while the rest of that chunk MAY be
> +      bogus).  Any later data chunks of the same reply MUST NOT
> +      contain the offset of this chunk.
> +
> +      Valid as a reply to `NBD_CMD_READ`.

If we are allowing structured replies to other commands, I think this
would be a useful reply to NBD_CMD_WRITE (and for that matter TRIM).
This is actually quite a good justification for allowing structured
replies to other commands.

We should probably explicitly say that more than one of these is
permitted.

> +
> +    - `NBD_REPLY_TYPE_OFFSET_DATA` (3)
> +
> +      This reply type represents a data chunk.  *length* MUST be at
> +      least 9.  The payload is structured as:
> +
> +      64 bits: offset (unsigned)  
> +      *length - 8* bytes: data  
> +
> +      This reply represents the contents of *length - 8* bytes of the
> +      file, starting at *offset*.  The data MUST lie within the bounds
> +      of the original offset and length of the client's request, and
> +      MUST NOT overlap with any earlier data or error chunks of the
> +      same reply.

Nope. It can overlap with error chunks (see the sendfile example). It
can't overlap with holes.

"and MUST NOT overlap with any earlier data or hole chunks of the
same reply"

> +
> +      Valid as a reply to `NBD_CMD_READ`.
> +
> +    - `NBD_REPLY_TYPE_OFFSET_HOLE` (4)
> +
> +      This reply type represents a data chunk.  *length* MUST be
> +      exactly 12.  The payload is structured as:
> +
> +      64 bits: offset (unsigned)  
> +      32 bits: hole size (unsigned)  
> +
> +      This reply represents that *hole size* bytes of the file (which
> +      MUST be non-zero), starting at *offset*, read as all zeroes.
> +      The hole MUST lie within the bounds of the original offset and
> +      length of the client's request, and MUST NOT overlap with any
> +      earlier data or error chunks of the same reply.

Again, it can overlap with error chunks (see the sendfile example). It
can't overlap with holes.

"and MUST NOT overlap with any earlier data or hole chunks of the
same reply"

> +      Valid as a reply to `NBD_CMD_READ`.
> +
> +* `NBD_CMD_FLAG_DF`
> +
> +    The "don't fragment" bit, valid during `NBD_CMD_READ`.  SHOULD be
> +    set to 1 if the client requires the server to send at most one
> +    data chunk in reply.  MUST NOT be set unless the transmission
> +    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
> +    `EOVERFLOW` error chunk, if the request length is too large.
> +
> +* `EOVERFLOW`
> +
> +    The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
> +    client has requested `NBD_CMD_FLAG_DF` for a length that is too
> +    large to read without fragmentation.  The server MUST NOT return
> +    this error if the read request did not exceed 65,536 bytes, and
> +    SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.

Although I'm implementing a server at the moment where I'd like to
do just that with lengths in excess of 2^31.

> +
> +* `NBD_CMD_READ`
> +
> +    If structured replies were not negotiated, then a read request
> +    MUST always be answered by a simple reply, as documented above
> +    (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
> +    length bytes of data according to the client's request, although
> +    those bytes MAY be invalid if an error is returned, and the
> +    connection MUST be closed if an error occurs after a header
> +    claiming no error).
> +
> +    If structured replies are negotiated, then a read request MUST
> +    result in a structured reply that MAY contain one or more chunks

MUST contain one or more chunks. Ot just "that contains one or more
chunks". It always has at least one chunk even if that's a NONE chunk - surely?
Or in your terminology is the NONE thing not a chunk? In my terminology
it's a chunk, but not a data chunk. This is also consistent with your
final para:
> 
> +    In order to avoid the burden of reassembly, the client MAY set the
> +    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
> +    fragment the reply.  If this flag is set, the server MUST send at
> +    most one data chunk, although it MAY still send multiple chunks
> +    (the remaining chunks would be error chunks or a final type of
> +    `NBD_REPLY_TYPE_NONE`).  A server MAY reject a client's request
> +    with the error `EOVERFLOW` if the length is too large to send
> +    without fragmentation, in which case it MUST NOT send a data
> +    chunk; however, the server MUST NOT use this if error the client's
> +    requested length does not exceed 65,536 bytes.
> +
> ## About this file

-- 
Alex Bligh
Reply to:
Follow-Ups:
- Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Eric Blake <eblake@...696...>
- Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Eric Blake <eblake@...696...>
References:
- [Nbd] [PATCH v3 0/5] Structured replies
  - From: Eric Blake <eblake@...696...>
- [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Eric Blake <eblake@...696...>
Prev by Date: [Nbd] NBD_CMD_FLAG_FUA
Next by Date: [Nbd] [ANN] gonbdserver / structured replies
Previous by thread: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
Next by thread: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
Index(es):
- Date
- Thread