Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size

To: Eric Blake <eblake@...696...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
Subject: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
From: Alex Bligh <alex@...872...>
Date: Wed, 13 Apr 2016 12:04:55 +0100
Message-id: <0BD53B45-59ED-47D5-A58D-22E6699FCC24@...872...>
In-reply-to: <1460506614-31604-1-git-send-email-eblake@...696...>
References: <1460506614-31604-1-git-send-email-eblake@...696...>
Eric,

On 13 Apr 2016, at 01:16, Eric Blake <eblake@...696...> wrote:

> Existing NBD servers often have limitations, such as requiring
> actions to be aligned to block sizes or limiting maximum
> transactions to avoid denial of service attacks; for example,
> qemu's NBD server refuses any transaction larger than 32M.  But
> to date, clients have to learn these limitations via out-of-band
> means.
> 
> This alters NBD_OPT_INFO and NBD_OPT_GO to use a new reply type
> NBD_REP_INFO (rather than overloading NBD_REP_SERVER), so that
> we have a future-proof way of supplying as much additional
> structured information about an export as we want.

... technically we could send NBD_OPT_INFO and a final
NBD_OPT_SERVER.

> Design decision: a client that wants to learn block sizes MUST
> use NBD_OPT_GO, rather than the old NBD_OPT_EXPORT_NAME, even
> though we could have repurposed some of the reserved zeroes when
> NBD_FLAG_C_NO_ZEROES is not in effect, because we don't want to
> encourage any further abuse of NBD_OPT_EXPORT_NAME.

Agree with the premise.

Is there anything to stop a server using NBD_OPT_INFO, then
NBD_OPT_EXPORTNAME (given the guarantee about things not changing
between them)? I agree that a server would be well advised for
every reason under the sun to use NBD_OPT_GO instead if it\s
already using NBD_OPT_INFO.


> Design decision: no new global NBD_FLAG or NBD_OPT are required;
> there is nothing for the client to negotiate.  The server merely
> provides as much information as it can, and the client then
> interprets what information it understands.  The items are
> structured so that a client can ignore details from the server
> that the client does not know about, and that we can easily add
> future items of information.

+1

> +## Block sizes
> +
> +During transmission phase, several operations are constrained by three
> +block sizes: minimum, preferred, and maximum; as well as the export
> +size sent in reply to `NBD_OPT_EXPORT_NAME` or `NBD_OPT_GO`.  A server
> +MAY advertise the three block size contraints during handshake phase
> +via the experimental `INFO` extension; see below.  However, not all
> +servers advertise block size information; if such information is not
> +also made available via out of band means, clients SHOULD assume some
> +default sizes

s/some default sizes/the default sizes set out in this document/

> for the maximum chance of successful interaction, as
> +described herein.
> +
> +The minimum block size represents the smallest addressable length and
> +alignment within the export, although writing to an area that small
> +may require the server to use a less-efficient read-modify-write
> +action.  If advertised, this value MUST be a power of 2, MUST NOT be
> +larger than 65,536, and MAY be as small as 1 for an export backed by a
> +regular file, although the values of 512 or 4,096 are more typical for
> +an export backed by a block device.  If a server advertises a minimum
> +block size, the advertised export size MUST be an integer multiple of
> +that block size.  If the server does not advertise a minimum block
> +size, the client MAY infer a minimum block size as the smaller of 512
> +bytes or the first power of 2 that evenly divides the export size.

I think better say "the client MUST infer the default minimum block
size". As Wouter says, the correct default minimum block size is 1,
because that's what the current spec says. If a server currently
can't handle a minimum block size of 1, it's currently non-compliant
and we should be encouraging it to adopt this extension (or fix it)
rather than bodge the spec around it.

> +The preferred block size represents the minimum size at which aligned
> +requests will have efficient I/O, avoiding behaviour such as
> +read-modify-write.  If advertised, this MUST be a power of 2 at least
> +as large as the minimum block size, and MUST NOT be larger than
> +3,355,442 (32M).

I'm not sure of the origin of the 32M limit or its purpose, but I
think it's harmless. The reason I'm not sure of its purpose is
because we already constrain things to be smaller than the maximum
block size.

I can think of systems with sharded disks where it is conceivable
they would prefer a larger preferred block size. If this is a problem
for the client, they can always limit it clientside.

>  The preferred block size MAY be larger than the
> +export size, in which case the client is unable to utilize the
> +preferred block size for that export.  If the server does not
> +advertise a preferred block size, the client MAY assume that a
> +preferred block size of 65,536 bytes will have reasonable performance.

Better, "If the server does not advertise a preferred block size,
the default preferred block size of 65,536 bytes will apply."

Somewhere else: "If the server does not advertise one or more
block sizes, and the block sizes explicitly advertised (if any)
MUST be compatible with the default block sizes used instead of
those not advertised."

> +
> +The maximum block size represents the maximum length that the server
> +is willing to handle in one request.  If advertised, it MUST be either
> +an integer multiple of the minimum block size or the value
> +4,294,967,295

Probably worth spelling out this is (uint32_t)(-1)

> for no inherent limit, MUST be at least as large as the
> +preferred block size, and SHOULD be at least 3,355,442 (32M), but MAY
> +be something other than a power of 2.  For convenience, the server MAY
> +advertise a maximum block size that is larger than the export size,
> +although in that case, the client MUST treat the export size as the
> +effectual maximum block size.

s/effectual/effective/

Shouldn't the effective maximum block size actually be 'export length
- offset'?

>  If the server does not advertise a
> +maximum block size, the client SHOULD NOT send requests with a length
> +larger than 3,355,442 (32M), to minimize the chance of the server
> +rejecting the request as too large.

Again, I'd say "a default maximum block size of X shall apply". And
I'm afraid I think (per Wouter) the number is -1, i.e. no maximum,
because that's what the current spec is. Even though plenty of us
have servers that violate this (me included).
> 
> +Where a transmission request can have a nonzero

s/nonzero/non-zero/ ?

> *offset* and/or
> +*length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
> +the client MUST ensure that *offset* and *length* are integer
> +multiples of any advertised minimum block size, and SHOULD use integer
> +multiples of any advertised preferred block size where possible.  For
> +those requests, the client MUST NOT use a *length* of zero, and MUST
> +NOT use a *length* larger than any advertised maximum block size or
> +which, when added to *offset*, would exceed the export size.  The
> +server SHOULD

MAY? I see no particular harm in the server handling it. Especially if
(e.g. on TRIM) it handles it through a NOP.

> report an `EINVAL` error if the client's request is not
> +aligned to advertised minimum block size boundaries, or is larger than
> +the advertised maximum block size, although the server MAY instead
> +disconnect if a large *length* could be deemed as a denial of service
> +attack.
> +
> ## Values
> 
> This section describes the value and meaning of constants (other than
> @@ -753,7 +814,7 @@ during option haggling in the fixed newstyle negotiation.
> 
> * `NBD_REP_SERVER` (2)
> 
> -    A description of an export. Data:
> +    A simple description of an export. Data:
> 
>     - 32 bits, length of name (unsigned); MUST be no larger than the
>       reply packet header length - 4
> @@ -766,9 +827,9 @@ during option haggling in the fixed newstyle negotiation.
>       particular client request, this field is defined to be a string
>       suitable for direct display to a human being.
> 
> -    The experimental `INFO` extension (see below) adds two client
> -    option requests where the extra data has a definition other than a
> -    text string.
> +* `NBD_REP_INFO` (3)
> +
> +    Defined by the experimental `INFO` extension; see below.
> 
> There are a number of error reply types, all of which are denoted by
> having bit 31 set. All error replies MAY have some data set, in which
> @@ -962,11 +1023,13 @@ The following error values are defined:
>   experimental `STRUCTURED_REPLY` extension; see below.
> 
> The server SHOULD return `ENOSPC` if it receives a write request
> -including one or more sectors beyond the size of the device.  It SHOULD
> -return `EINVAL` if it receives a read or trim request including one or
> -more sectors beyond the size of the device.  It also SHOULD map the
> -`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
> -`EPERM` if it receives a write or trim request on a read-only export.
> +including one or more sectors beyond the size of the device.

TBH I think `EINVAL` for all offset/length errors would be more appropriate.

>  It
> +SHOULD return `EINVAL` if it receives a read or trim request including
> +one or more sectors beyond the size of the device, or if a read or
> +write request is not aligned to advertised minimum block sizes. It
> +also SHOULD map the `EDQUOT` and `EFBIG` errors to `ENOSPC`.

These error codes seem very linux / implementation specific. Also
I find a read returning an ENOSPC error a bit weird. lseek, readv
etc. all return EINVAL in such circumstances. ENOSPC is normally
for operations which would take up more space.

>  Finally,
> +it SHOULD return `EPERM` if it receives a write or trim request on a
> +read-only export.

s/SHOULD/MUST/?

> 
> The server SHOULD return `EINVAL` if it receives an unknown command.

hmmm, then again, EINVAL seems overloaded.

> 
> @@ -1000,12 +1063,11 @@ any structured message). This is a result of a (misguided) attempt to
> keep backwards compatibility with non-fixed newstyle negotiation.
> 
> To remedy this, an `INFO` extension is envisioned. This extension adds
> -two option requests and one error reply type, and extends one existing
> -option reply type.
> +two option requests, one option reply type, and one error reply type.
> 
> Both options have identical formats for requests and replies. The
> only difference is that after a successful reply to `NBD_OPT_GO`
> -(i.e. an `NBD_REP_SERVER`), transmission mode is entered immediately.
> +(i.e. an `NBD_REP_INFO`), transmission mode is entered immediately.
> Therefore these commands share common documentation.
> 
> * `NBD_OPT_INFO` and `NBD_OPT_GO`
> @@ -1038,7 +1100,7 @@ Therefore these commands share common documentation.
>       server.
>     - `NBD_REP_ERR_TLS_REQD`: The server does not wish to export this
>       block device unless the client initiates TLS first.
> -    - `NBD_REP_SERVER`: The server accepts the chosen export.
> +    - `NBD_REP_INFO`: The server accepts the chosen export.
> 
>     Additionally, if TLS has not been initiated, the server MAY reply
>     with `NBD_REP_ERR_TLS_REQD` (instead of `NBD_REP_ERR_UNKNOWN`)
> @@ -1048,22 +1110,8 @@ Therefore these commands share common documentation.
>     For backwards compatibility, clients SHOULD be prepared to also
>     handle `NBD_REP_ERR_UNSUP` by falling back to using `NBD_OPT_EXPORT_NAME`.
> 
> -    In the case of `NBD_REP_SERVER`, the message's data takes on a different
> -    interpretation than the default (so as to provide additional
> -    binary information normally sent in reply to `NBD_OPT_EXPORT_NAME`,
> -    in place of the default free-form string). The option reply length
> -    MUST be *length of name* + 14, and the option data has the following layout:
> -
> -    - 32 bits, length of name (unsigned)  
> -    - String: name of the export. This name MAY be different from the one
> -      given in the `NBD_OPT_INFO` or `NBD_OPT_GO` option in case the
> -      server has multiple alternate names for a single export, or a
> -      default export was specified.  
> -    - 64 bits, size of the export in bytes (unsigned)  
> -    - 16 bits, transmission flags.  
> -
>     If there are no intervening option requests between a successful
> -    `NBD_OPT_INFO` (that is, one that replied with `NBD_REP_SERVER`)
> +    `NBD_OPT_INFO` (that is, one that replied with `NBD_REP_INFO`)
>     and an `NBD_OPT_GO` with the same parameters, then the server MUST
>     use an identical success response, including transmission flags.
>     Otherwise, due to the intervening option requests or the use of
> @@ -1071,16 +1119,80 @@ Therefore these commands share common documentation.
>     flags, and/or MAY fail the second request.
> 
>     The reply to an `NBD_OPT_GO` is identical to the reply to `NBD_OPT_INFO`
> -    save that if the reply indicates success (i.e. is `NBD_REP_SERVER`),
> +    save that if the reply indicates success (i.e. is `NBD_REP_INFO`),
>     the client and the server both immediately enter the transmission
>     phase. The server MUST NOT send any zero padding bytes after the
> -    `NBD_REP_SERVER` data, whether or not the client negotiated the
> +    `NBD_REP_INFO` data, whether or not the client negotiated the
>     `NBD_FLAG_C_NO_ZEROES` flag. After sending this reply the server MUST
>     immediately move to the transmission phase, and after receiving this
>     reply, the client MUST immediately move to the transmission phase;
>     therefore, the server MUST NOT send this particular reply until all
>     other pending option replies have been sent by the server.
> 
> +* `NBD_REP_INFO`
> +
> +    A detailed description of an export, subdivided into one or more
> +    items of information, used by `NBD_OPT_INFO` and `NBD_OPT_GO` on
> +    success, and which provides a superset of the information given in
> +    reply to `NBD_OPT_EXPORT_NAME`.  The header *length* MUST be at
> +    least 14, and MUST match the length visited by reading items in
> +    turn (that is, 4 * number of items + sum of *info length* from
> +    each item).

It would be easier to program this if there wasn't a total length,
just individual items, and the last one was a NBD_INFO_DONE sentinel,
or perhaps had a bit set on its type. Otherwise you need to find the
length of ALL your strings etc. in advance, which is painful. A
client is going to want to skip these one by one anyway (as it
needs to interpret them to see whether it understands each one)
so a length at the start gives no advantage.

>  The data layout is a sequence of items, each with the
> +    layout:
> +
> +    - 16 bits, info type  
> +    - 16 bits, info length  
> +    - *info length* bytes  
> +
> +    The info type `NBD_INFO_SIZE_FLAG` is mandatory, and MUST appear
> +    as the last item; all other info types are optional, and MUST NOT
> +    appear more than once.  A client MUST ignore unknown info types.
> +
> +    The following info types are defined:
> +
> +    * `NBD_INFO_SIZE_FLAG` (0)
> +
> +      The *info length* MUST be 10, and represents the same
> +      information as given in reply to `NBD_OPT_EXPORT_NAME` as if
> +      `NBD_FLAG_C_NO_ZEROES` were negotiated.
> +
> +      - 64 bits, size of the export in bytes (unsigned)  
> +      - 16 bits, transmission flags

I vote for breaking these up into two separate options. In the
current nbd-server implementation, finding the export length
is "hard", and finding the transmission flags easy(er).
There seems no reason to keep them together.

Should sending transmission flags be compulsory?

Should sending export size be compulsory?

> +    * `NBD_INFO_NAME` (1)
> +
> +      Represents the server's canonical name of the export. The name
> +      MAY differ from the name presented in the client's option
> +      request, and the info item SHOULD be omitted unless the server
> +      has multiple alternate names for a single export or a default
> +      export was specified.
> +
> +      - String: name of the export, with *info length* bytes  
> +
> +    * `NBD_INFO_MINIMUM_BLOCK` (2)
> +
> +      The *info length* MUST be 4, and represents the minimum block
> +      size.  See the "Block sizes" section for further requirements on
> +      its value.
> +
> +      - 32 bits, minimum block size  
> +
> +    * `NBD_INFO_PREFERRED_BLOCK` (3)
> +
> +      The *info length* MUST be 4, and represents the preferred block
> +      size.  See the "Block sizes" section for further requirements on
> +      its value.
> +
> +      - 32 bits, preferred block size  
> +
> +    * `NBD_INFO_MAXIMUM_BLOCK` (4)
> +
> +      The *info length* MUST be 4, and represents the maximum block
> +      size.  See the "Block sizes" section for further requirements on
> +      its value.
> +
> +      - 32 bits, maximum block size  

I like these.

> ### `WRITE_ZEROES` extension
> 
> There exist some cases when a client knows that the data it is going to write
> @@ -1291,13 +1403,15 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>       be at least 14.  This reply represents that an error occurred at
>       a given offset, which MUST lie within the original offset and
>       length of the request; the client can use this offset to
> -      determine if request had any partial success.  This chunk type
> -      MAY appear multiple times in a structured reply, although the
> -      same offset SHOULD NOT be repeated.  Likewise, if content chunks
> -      were sent earlier in the structured reply, the server SHOULD NOT
> -      send multiple distinct offsets that lie within the bounds of a
> -      single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> -      `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
> +      determine if request had any partial success.  The server MAY
> +      use an offset that is not a multiple of any advertised minimum
> +      block size.  This chunk type MAY appear multiple times in a
> +      structured reply, although the same offset SHOULD NOT be
> +      repeated.  Likewise, if content chunks were sent earlier in the
> +      structured reply, the server SHOULD NOT send multiple distinct
> +      offsets that lie within the bounds of a single content chunk.
> +      Valid as a reply to `NBD_CMD_READ`, `NBD_CMD_WRITE`,
> +      `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
> 
>       The payload is structured as:
> 
> @@ -1357,8 +1471,9 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>     The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
>     client has requested `NBD_CMD_FLAG_DF` for a length that is too
>     large to read without fragmentation.  The server MUST NOT return
> -    this error if the read request did not exceed 65,536 bytes, and
> -    SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
> +    this error if the read request did not exceed the larger of 65,536
> +    bytes or the advertised preferred block size,

ok

> and SHOULD NOT
> +    return this error if `NBD_CMD_FLAG_DF` is not set.
> 
> * `NBD_CMD_READ`
> 
> @@ -1381,12 +1496,17 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>     overhead, the server SHOULD use chunks with lengths and offsets as
>     an integer multiple of 512 bytes, where possible (the first and
>     last chunk of an unaligned read being the most obvious places for
> -    an exception).  The server MUST NOT send content chunks that
> -    overlap with any earlier content or error chunk, and MUST NOT send
> -    chunks that describe data outside the offset and length of the
> -    request, but MAY send the content chunks in any order (the client
> -    MUST reassemble content chunks into the correct order), and MAY
> -    send additional content chunks even after reporting an error chunk.
> +    an exception).  If the server advertised block sizes,

Again I'd go with the view that there is always a {minimum,maximum,preferred}
block size, and if they are not advertised, they are just the default values.
So this should always apply. But the default minimum block size would be 1.

> it MUST
> +    ensure that every chunk has a length and offset which are an
> +    integer multiple of the minimum block size, and SHOULD use chunks
> +    with multiples of the preferred block size where possible.

s/multiples/integer multiples/

> +    The server MUST NOT send content chunks that overlap with any
> +    earlier content or error chunk, and MUST NOT send chunks that
> +    describe data outside the offset and length of the request, but
> +    MAY send the content chunks in any order (the client MUST
> +    reassemble content chunks into the correct order), and MAY send
> +    additional content chunks even after reporting an error chunk.
>     Note that a request for more than 2^32 - 8 bytes MUST be split
>     into at least two chunks, so as not to overflow the length field
>     of a reply while still allowing space for the offset of each
> @@ -1441,7 +1561,8 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>     if the length is too large to send without fragmentation, in which
>     case it MUST NOT send a content chunk; however, the server MUST
>     support unfragmented reads in which the client's request length
> -    does not exceed 65,536 bytes.
> +    does not exceed the larger of 65,536 bytes or the advertised
> +    preferred block size.

ok.

Other than that, looking good in principle.

-- 
Alex Bligh
Reply to:
Follow-Ups:
- Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
  - From: Eric Blake <eblake@...696...>
References:
- [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
  - From: Eric Blake <eblake@...696...>
Prev by Date: Re: [Nbd] [PATCH] Docs: improve description of disconnection methods
Next by Date: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Previous by thread: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Next by thread: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Index(es):
- Date
- Thread