Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension

To: Alex Bligh <alex@...872...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Wouter Verhelst <w@...112...>
Subject: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
From: Eric Blake <eblake@...696...>
Date: Thu, 31 Mar 2016 08:51:18 -0600
Message-id: <56FD3966.5010307@...696...>
In-reply-to: <A6C62F41-19EF-4C8A-AC3F-031903A0F9F3@...872...>
References: <1459404384-5258-1-git-send-email-eblake@...696...> <1459404384-5258-5-git-send-email-eblake@...696...> <A6C62F41-19EF-4C8A-AC3F-031903A0F9F3@...872...>
On 03/31/2016 05:23 AM, Alex Bligh wrote:
> Eric,
> 
> We're getting there! I'm only going to comment on this one rather than
> the 'final form' patch, so we get this one right to start off with.

Agreed.

> 
> On 31 Mar 2016, at 07:06, Eric Blake <eblake@...696...> wrote:
>> ### Transmission
>>
>> There are two message types in the transmission phase: the request,
>> -and the reply.  The phase consists of a series of transactions, where
>> -the client submits requests and the server sends corresponding
>> -replies, with a single reply message per request, and continues until
>> -either side closes the connection.
>> +and the simple reply.
> 
> 
> But if this describes transmissions in general, shouldn't this be
> "the request, and the reply; the reply may be either a simple
> reply or a structured reply (for which is permissible under
> which circumstances, see below)"

Maybe I just bite the bullet and call out three message types, but then
add the caveat that the third type is experimental. I'll play with the
idea for how to best call out the forward reference, while still making
it obvious that normative implementations do not have to worry about it yet.


>> -The server replies with:
>> +[option #A1 - only replies with payload are affected]
>> +The simple reply message MUST be sent by the server in response to a
>> +request that requires no data payload.
> 
> Well that's *definitely* wrong (even if I wrote it!) as it implies
> that if structured replies have been negotiated, you must still
> send a simple reply if the result is an error. That can't be right!
> I think you (I?) mean "in response to a request that never returns
> a data payload".

Indeed, my wording isn't quite right.  I'm starting to favor option #2
rather than #1, so maybe we don't need to try any harder on the wording
here, but I am trying to convey that "if a request never requires data
in its reply (which covers all existing requests except CMD_READ at the
current time), then for back-compat reasons those requests must continue
to use simple replies for both success and error even when structured
replies are negotiated".

But having written that, I still see two variations within both option 1
and option 2:

- if a request has any mode that requires data in its reply, then ALL
replies to that request must be structured

- a reply that includes data MUST use a structured reply, but a request
(other than NBD_CMD_READ) can still have a simple reply in cases where
no data is required (such as when reporting an error like EINVAL on
improper flags

That is, Option #A1, #A2, and #A3 were concerned about what
NBD_CMD_WRITE can reply, while my two variations are concerned about
what NBD_CMD_GET_LBA_STATUS can reply (I still think, for back-compat
reasons, the NBD_CMD_READ must be special-cased, and be all-or-none with
structured replies, even when reporting an error like EINVAL for an
unrecognized flag).

If we go with option #A1 (CMD_WRITE never uses structured), then the
parallel argument is that commands with possibility of data must never
use simple.  If we go with option #A2 (CMD_WRITE may use structured),
then (other than CMD_READ), it makes a bit more sense to allow a simple
reply if the server is sending an error without data (then again, the
whole notion of the structured error with UTF-8 human-readable string
makes it less likely that we'd even have a reason to reply without data).

>> +[option #A3 - enabling structured replies MUST affect all commands]
>> +The simple reply message MUST be sent by the server in response to all
>> +requests if the experimental `STRUCTURED_REPLY` extension was not
>> +negotiated,
> 
> "not negotiated" should read "negotiated"!

D'oh. Blame late-night editing.  But since it's looking like we don't
want option #3, I'm not too worried about rewording this one.

> 
> I can't help thinking that if the structured reply thing is an
> experimental extension, the wording on when it should be used would
> be better down with where it describes the extension.

Again, it's a tricky balance of how much to describe up front and how
much to put in the experimental section, especially since it all moves
up front after we implement it, and I have a secondary goal of trying to
make that move easier to perform.

> 
>> * `NBD_CMD_READ` (0)
>>
>>     A read request. Length and offset define the data to be read. The
>> -    server MUST reply with a reply header, followed immediately by len
>> -    bytes of data, read offset bytes into the file, unless an error
>> -    condition has occurred.
>> +    server MUST reply with a simple reply header, followed immediately
>> +    by len bytes of data, read from offset bytes into the file, unless
>> +    an error condition has occurred.
> 
> I think: "The server MUST reply with either a simple reply or a
> structured reply; which is permissible is defined elsewhere in
> this document. In the case of a simple reply, the reply consists
> of a simple reply header ..."
> 
> (structured replies are defined elsewhere)

That seems like a good approach.

>> @@ -500,7 +540,7 @@ The following request types exist:
>> * `NBD_CMD_FLUSH` (3)
>>
>>     A flush request; a write barrier. The server MUST NOT send a
>> -    successful reply header for this request before all write requests
>> +    successful simple reply header for this request before all write requests
> 
> This depends whether we choose #A1, #A2, or #A3. I think it's better
> to just refer to "reply" here.

Sure.  Stating 'simple reply' was evidence that I was leaning towards
#A1; but now that I'm leaning towards #A2, leaving it at just 'reply' is
indeed better.


>> +    [option #B2 - final transmission flags are accurate, but
>> +    intermediate transmission flags can anticipate negotiation; state
>> +    change can be observed if negotiation does not happen]
>> +    When responding to the `NBD_OPT_EXPORT_NAME` option request (or
>> +    the `NBD_OPT_SELECT` request of the experimental `SELECT`
>> +    extension), the server MUST set this transmission flag to 1 if
>> +    structured replies have been negotiated, and MUST NOT set this
>> +    flag otherwise; that way, the client MAY reliably use the final
>> +    state of this flag as a reliable witness of whether to expect a
>> +    simple reply or structured reply to the `NBD_CMD_READ`
>> +    transmission request.  When responding to the `NBD_OPT_LIST`
>> +    option request, the server MAY set this transmission flag, even if
>> +    structured replies have not yet been negotiated.
> 
> This is an export flag, yes? Save in oldstyle negotiation (which surely

Now called 'transmission flag', but yes.

> no one cares about as structured replies can't have been negotiated) the
> export flags are sent after option haggling, in which case we always
> know whether to set this - we set it to 1 if structured replies have
> been negotiated, or to 0 otherwise. That's all this needs to say.
> 
> Therefore the difference in text merely comes down to its additional
> use in the SELECT extension. That can be covered by a simple section
> in the SELECT extension saying "The value of NBD_FLAG_SEND_DF in
> the select extension cannot be relied upon by the Client and MUST
> be set to zero by the server. This is because at the stage the
> SELECT extension is negotiated, structured replies may or may not
> have been negotiated".

Okay, that works for me.


> 
> Could someone recap why we need this flag? We know when DF can be
> sent, it's exactly when structured replies have been negotiated.
> I think Wouter suggested there should be a 1:1 relationship between
> SEND_ flags and CMD_ flags for convenience of user space. But that
> seems odd as there isn't at the moment. For instance SEND_FUA is
> on bit 3, CMD_FUA is on bit 0 or (cough) 16. None of the others
> have CMD flag equivalents as far as I can see.
> 
> My vote would be delete the flag (again)!

Wouter proposed a pretty compelling use case: when the client is split
between user/kernel, then the flow is:

user->kernel: ioctl(): are structured replies supported? if so:
user->server: NBD_OPT_STRUCTURED_REPLY
user->server: OPT_EXPORT_NAME/OPT_SELECT-OPT_GO
server->user: transmission flags reflect if structured replies are on
user->kernel: ioctl(): inform about transmission flags
user->kernel: ioctl(): phase change
kernel->server: send CMD_READ, reported flag says which response to expect

Yeah, the user space already has to ask the kernel if the kernel is
willing to receive structured replies before sending
NBD_OPT_STRUCTURED_REPLY to the server, but making the server echo back
the negotiation in the transmission flags saves the userspace from
needing an additional ioctl() to inform the kernel about whether the
negotiation with the server was successful.

> 
>> +    Unless explicitly documented for a given request, a structured
>> +    reply MUST occupy only one message (similar to a simple reply).
> 
> I think that's unnecessary. If you are going to allow structured
> replies, I see no reason why you can't reply with an error chunk,
> then a final 'none' chunk with the 'end' bit set. Apologies if I
> made this point before and it was rejected.

Okay, I can reword things to ALWAYS allow multiple chunks.  It is less
efficient on bandwidth, but may be simpler to code a server that way.

> 
>> +    However, some requests document that a structured reply MAY occupy
>> +    multiple chunks; each chunk uses a structured reply message (all
>> +    with the same value for "handle"), and the `NBD_REPLY_FLAG_DONE`
>> +    reply flag is used to identify the final chunk.  Where multiple
>> +    chunks are permitted, the intermediate chunks MAY be reordered
>> +    within constraints documented by the request, and the chunks MAY
>> +    be interleaved with messages from other pending transactions; but
>> +    the final chunk MUST always end the reply.
> 
> The above change would simplify this.
> 
>> +    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
>> +      more structured reply chunks will be sent for the same client
>> +      request, and MUST set this bit if this is the final reply.  This
>> +      flag must always be set in response to requests which are
>> +      documented as using a structured reply, but not documented as
>> +      permitting multiple chunks.
> 
> Ditto

Indeed.

> 
>> +    The server MUST NOT set any other flags without first negotiating
>> +    the extension with the client.  Clients that receive an
>> +    unrecognized flag SHOULD close the connection.
> 
> An alternative formulation would be that:
> 
> The server MUST NOT set any other flags without negotiating the
> extension with the client, unless the client can perform the
> action requested without interpreting the flag (for instance if
> the flag is some form of hint). Clients MUST ignore unrecognised
> flags.
> 
> This leaves it open for the server to negotiate additional flags
> but means simple flags that are hints do not require negotiation
> changes.

Nice.  I'll fold that in.

> 
>> +    - `NBD_REPLY_TYPE_ERROR` (1)
>> +
>> +      This reply type represents an error chunk.  *length* MUST be
>> +      exactly 4.  The payload is structured as:
>> +
>> +      32 bits: error (MUST be nonzero)  
>> +
>> +      This reply represents that an error occurred, and the client MAY
>> +      NOT make any assumptions about partial success.
> 
> "This reply represents the occurrence of an error. The client MUST NOT
> make any assumptions as to partial success and MUST treat all data
> received (if any) as potentially invalid."
> 
>> This type SHOULD
>> +      NOT be used unless it is the final reply chunk (where the flag
>> +      `NBD_REPLY_FLAG_DONE` is set), or if it is immediately followed
>> +      by a chunk with type `NBD_REPLY_TYPE_NONE`.
> 
> I know this is only a 'SHOULD' but I don't see any particular reason for
> this distinction. Going back to the Ceph example, it sets of 10
> reads in parallel, and gets an error. Why not simply send an error
> packet, then when everything is done, send a 'NONE' with 'DONE' set?
> Sounds reasonable to me.
> 
> You don't specifically prohibit more than one NBD_REPLY_TYPE_ERROR which
> is probably a good thing, but if you intended to, you should document
> it.

Okay, at this point, I'm leaning towards 'this chunk type SHOULD NOT be
sent more than once in a structured reply' (SHOULD NOT, rather than MUST
NOT, because there may still be lazy implementations that uses this type
of error after each chunk with an error rather than reporting offsets,
but good implementations will not do it), and drop the bit about it
needing to be last or next to last.


>> +    - `NBD_REPLY_TYPE_ERROR_OFFSET` (2)
>> +
>> +      This reply type represents an error chunk.  *length* MUST be
>> +      exactly 12.  The payload is structured as:
>> +
>> +      32 bits: error (MUST be nonzero)  
>> +      64 bits: offset (unsigned)  
>> +
>> +      In addition to declaring that an error occurred, this type
>> +      provides enough additional information to inform the client
>> +      about any partial success.  *offset* MUST lie within the bounds
>> +      of the original offset and length of the client's request.  If
>> +      *offset* also lies within the bounds of an earlier data chunk of
>> +      the same reply, then the client MAY assume that data within that
>> +      earlier chunk is valid (while the rest of that chunk MAY be
>> +      bogus).  Any later data chunks of the same reply MUST NOT
>> +      contain the offset of this chunk.
>> +
>> +      Valid as a reply to `NBD_CMD_READ`.
> 
> If we are allowing structured replies to other commands, I think this
> would be a useful reply to NBD_CMD_WRITE (and for that matter TRIM).
> This is actually quite a good justification for allowing structured
> replies to other commands.
> 

Yes, that, and the idea of UTF-8 human-readable errors, are the reasons
I'm now leaning towards option #2.

> We should probably explicitly say that more than one of these is
> permitted.

Indeed.

> 
>> +
>> +    - `NBD_REPLY_TYPE_OFFSET_DATA` (3)
>> +
>> +      This reply type represents a data chunk.  *length* MUST be at
>> +      least 9.  The payload is structured as:
>> +
>> +      64 bits: offset (unsigned)  
>> +      *length - 8* bytes: data  
>> +
>> +      This reply represents the contents of *length - 8* bytes of the
>> +      file, starting at *offset*.  The data MUST lie within the bounds
>> +      of the original offset and length of the client's request, and
>> +      MUST NOT overlap with any earlier data or error chunks of the
>> +      same reply.
> 
> Nope. It can overlap with error chunks (see the sendfile example). It
> can't overlap with holes.

No, I meant the wording I used.  If you send an error chunk, you MUST
NOT then send a data chunk that repeats the offset mentioned in the
earlier error chunk.  An error chunk may overlap an earlier data chunk,
but not the other way around.

> 
> "and MUST NOT overlap with any earlier data or hole chunks of the
> same reply"

And here, I was TRYING to be clever by using "data chunk" to mean
"either TYPE_OFFSET_DATA or TYPE_OFFSET_HOLE".  But that overloads the
term "data chunk", so maybe I can come up with a better term?

> 
>> +
>> +      Valid as a reply to `NBD_CMD_READ`.
>> +
>> +    - `NBD_REPLY_TYPE_OFFSET_HOLE` (4)
>> +
>> +      This reply type represents a data chunk.  *length* MUST be
>> +      exactly 12.  The payload is structured as:
>> +
>> +      64 bits: offset (unsigned)  
>> +      32 bits: hole size (unsigned)  
>> +
>> +      This reply represents that *hole size* bytes of the file (which
>> +      MUST be non-zero), starting at *offset*, read as all zeroes.
>> +      The hole MUST lie within the bounds of the original offset and
>> +      length of the client's request, and MUST NOT overlap with any
>> +      earlier data or error chunks of the same reply.
> 
> Again, it can overlap with error chunks (see the sendfile example). It
> can't overlap with holes.
> 
> "and MUST NOT overlap with any earlier data or hole chunks of the
> same reply"

Again, I was trying to lump data and hole types both as "data chunks".
Maybe "content chunks" is a good unambiguous term?

>> +* `EOVERFLOW`
>> +
>> +    The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
>> +    client has requested `NBD_CMD_FLAG_DF` for a length that is too
>> +    large to read without fragmentation.  The server MUST NOT return
>> +    this error if the read request did not exceed 65,536 bytes, and
>> +    SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
> 
> Although I'm implementing a server at the moment where I'd like to
> do just that with lengths in excess of 2^31.

Oh, good point - particularly true if you are targeting 32-bit hardware
where ssize_t limits read() and write() to 2^31 bytes in a transaction
(and in that case, both CMD_READ and CMD_WRITE should consider this case).

But what do we do about Wouter's request that we not permit EOVERFLOW
without structured replies being negotiated?  Does that mean that
CMD_WRITE cannot reply with EOVERFLOW unless structured replies are
negotiated?

>> +    If structured replies are negotiated, then a read request MUST
>> +    result in a structured reply that MAY contain one or more chunks
> 
> MUST contain one or more chunks. Ot just "that contains one or more
> chunks". It always has at least one chunk even if that's a NONE chunk - surely?

Rebase flaw, will fix.

> Or in your terminology is the NONE thing not a chunk? In my terminology
> it's a chunk, but not a data chunk. This is also consistent with your
> final para:

I _want_ to call the NONE type a chunk, but it is neither a 'data chunk'
('content chunk'?) nor an 'error chunk', and is a special type in that
if it is used in a structured reply, it must always be the final chunk.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org
Attachment: signature.asc
Description: OpenPGP digital signature
Reply to:
Follow-Ups:
- Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Alex Bligh <alex@...872...>
References:
- [Nbd] [PATCH v3 0/5] Structured replies
  - From: Eric Blake <eblake@...696...>
- [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Eric Blake <eblake@...696...>
- Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
  - From: Alex Bligh <alex@...872...>
Prev by Date: Re: [Nbd] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension
Next by Date: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
Previous by thread: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
Next by thread: Re: [Nbd] [PATCH v3 4/5] doc: Propose STRUCTURED_REPLY extension
Index(es):
- Date
- Thread