Re: [Nbd] [PATCH v5 2/2] doc: Add details on block sizes
- To: Eric Blake <eblake@...696...>
- Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
- Subject: Re: [Nbd] [PATCH v5 2/2] doc: Add details on block sizes
- From: Alex Bligh <alex@...872...>
- Date: Sun, 17 Apr 2016 11:46:36 +0100
- Message-id: <CF5BE3AC-3A44-4C3C-9C06-CBE6053CAEEF@...872...>
- In-reply-to: <1460842265-21385-3-git-send-email-eblake@...696...>
- References: <1460842265-21385-1-git-send-email-eblake@...696...> <1460842265-21385-3-git-send-email-eblake@...696...>
Eric,
Applied.
And I'm now going to move them out to a separate branch.
Alex
On 16 Apr 2016, at 22:31, Eric Blake <eblake@...696...> wrote:
> Existing NBD servers often have limitations, such as requiring
> actions to be aligned to block sizes or limiting maximum
> transactions to avoid denial of service attacks; for example,
> qemu's NBD server refuses any transaction larger than 32M. But
> to date, clients have to learn these limitations via out-of-band
> means, and nothing in the spec allowed for alignment limitations.
>
> Add a section to the document describing overall block size
> constraints, and rules for what defaults to use if there is no
> communication (whether out of band, or by the new options added
> here).
>
> Also, add a new client option NBD_OPT_BLOCK_SIZE (a promise that
> the client will obey any advertised block sizes, to let a server
> optimize to use O_DIRECT without worrying about how it would have
> to report errors), and extend NBD_REP_INFO (to allow the server
> to advertise block sizes in band, for a new enough client that
> uses NBD_OPT_GO).
>
> Design decision: a client that wants to learn block sizes MUST
> use NBD_OPT_GO, rather than the old NBD_OPT_EXPORT_NAME, even
> though we could have repurposed some of the reserved zeroes when
> NBD_FLAG_C_NO_ZEROES is not in effect, because we don't want to
> encourage any further abuse of NBD_OPT_EXPORT_NAME.
>
> Signed-off-by: Eric Blake <eblake@...696...>
> ---
> doc/proto.md | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 165 insertions(+), 22 deletions(-)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index 402a6be..4958348 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -674,6 +674,87 @@ This functionality has not yet been implemented by the reference
> implementation, but was implemented by qemu and subsequently
> by other users, so has been moved out of the "experimental" section.
>
> +## Block sizes
> +
> +During transmission phase, several operations are constrained by the
> +export size sent by the final `NBD_OPT_EXPORT_NAME` or `NBD_OPT_GO`,
> +as well as by three block sizes defined here (minimum, preferred, and
> +maximum). If a client can honor server block sizes (as set out in the
> +experimental `BLOCK_SIZE` extension below), it SHOULD announce this
> +during the handshake phase, and SHOULD use `NBD_OPT_GO` rather than
> +`NBD_OPT_EXPORT_NAME`. A server SHOULD advertise the block size
> +contraints during handshake phase via the experimental `INFO`
> +extension; see below. A server and client MAY agree on block sizes
> +via out of band means.
> +
> +If block sizes have not been advertised or agreed on externally, then
> +a client SHOULD assume a default minimum block size of 1, a preferred
> +block size of 2^12 (4,096), and a maximum block size of the smaller of
> +the export size or 0xffffffff (effectively unlimited). A server that
> +wants to enforce block sizes other than the defaults specified here
> +MUST support the experimental `INFO` extension, and MAY refuse to go
> +into transmission phase with a client that uses `NBD_OPT_EXPORT_NAME`
> +or failed to use `NBD_OPT_BLOCK_SIZE`, although a server SHOULD permit
> +such clients if block sizes can be agreed on externally. When
> +allowing such clients, the server MUST cleanly error commands that
> +fall outside block size parameters without corrupting data; even so,
> +this may limit interoperability.
> +
> +A client MAY choose to operate as if tighter block sizes had been
> +specified (for example, even when the server advertises the default
> +minimum block size of 1, a client may safely use a minimum block size
> +of 2^9 (512), a preferred block size of 2^16 (65,536), and a maximum
> +block size of 2^25 (33,554,432)). Notwithstanding any maximum block
> +size advertised, either the server or the client MAY initiate a hard
> +disconnect if the size of a request or a reply is large enough to be
> +deemed a denial of service attack.
> +
> +The minimum block size represents the smallest addressable length and
> +alignment within the export, although writing to an area that small
> +may require the server to use a less-efficient read-modify-write
> +action. If advertised, this value MUST be a power of 2, MUST NOT be
> +larger than 2^16 (65,536), and MAY be as small as 1 for an export
> +backed by a regular file, although the values of 2^9 (512) or 2^12
> +(4,096) are more typical for an export backed by a block device. If a
> +server advertises a minimum block size, the advertised export size
> +SHOULD be an integer multiple of that block size, since otherwise, the
> +client would be unable to access the final few bytes of the export.
> +
> +The preferred block size represents the minimum size at which aligned
> +requests will have efficient I/O, avoiding behaviour such as
> +read-modify-write. If advertised, this MUST be a power of 2 at least
> +as large as the smaller of the minimum block size and 2^12 (4,096),
> +although larger values (such as the minimum granularity of a hole) are
> +also appropriate. The preferred block size MAY be larger than the
> +export size, in which case the client is unable to utilize the
> +preferred block size for that export. The server MAY advertise an
> +export size that is not an integer multiple of the preferred block
> +size.
> +
> +The maximum block size represents the maximum length that the server
> +is willing to handle in one request. If advertised, it MUST be either
> +an integer multiple of the minimum block size or the value 0xffffffff
> +for no inherent limit, MUST be at least as large as the smaller of the
> +preferred block size or export size, and SHOULD be at least 2^25
> +(33,554,432) if the export is that large, but MAY be something other
> +than a power of 2. For convenience, the server MAY advertise a
> +maximum block size that is larger than the export size, although in
> +that case, the client MUST treat the export size as the effective
> +maximum block size (as further constrained by a non-zero offset).
> +
> +Where a transmission request can have a non-zero *offset* and/or
> +*length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
> +the client MUST ensure that *offset* and *length* are integer
> +multiples of any advertised minimum block size, and SHOULD use integer
> +multiples of any advertised preferred block size where possible. For
> +those requests, the client MUST NOT use a *length* larger than any
> +advertised maximum block size or which, when added to *offset*, would
> +exceed the export size. The server SHOULD report an `EINVAL` error if
> +the client's request is not aligned to advertised minimum block size
> +boundaries, or is larger than the advertised maximum block size,
> +although the server MAY instead initiate a hard disconnect if a large
> +*length* could be deemed as a denial of service attack.
> +
> ## Values
>
> This section describes the value and meaning of constants (other than
> @@ -831,6 +912,10 @@ of the newstyle negotiation.
>
> Defined by the experimental `STRUCTURED_REPLY` extension; see below.
>
> +- `NBD_OPT_BLOCK_SIZE` (9)
> +
> + Defined by the experimental `BLOCK_SIZE` extension; see below.
> +
> #### Option reply types
>
> These values are used in the "reply type" field, sent by the server
> @@ -1063,11 +1148,13 @@ The following error values are defined:
> * `ESHUTDOWN` (108), Server is in the process of being shut down.
>
> The server SHOULD return `ENOSPC` if it receives a write request
> -including one or more sectors beyond the size of the device. It SHOULD
> +including one or more sectors beyond the size of the device. It also
> +SHOULD map the `EDQUOT` and `EFBIG` errors to `ENOSPC`. It SHOULD
> return `EINVAL` if it receives a read or trim request including one or
> -more sectors beyond the size of the device. It also SHOULD map the
> -`EDQUOT` and `EFBIG` errors to `ENOSPC`. Finally, it SHOULD return
> -`EPERM` if it receives a write or trim request on a read-only export.
> +more sectors beyond the size of the device, or if a read or write
> +request is not aligned to advertised minimum block sizes. Finally, it
> +SHOULD return `EPERM` if it receives a write or trim request on a
> +read-only export.
>
> The server SHOULD return `EINVAL` if it receives an unknown command.
>
> @@ -1252,10 +1339,57 @@ documentation.
> - 16 bits, `NBD_INFO_DESCRIPTION`
> - String: description of the export, *length - 2* bytes
>
> + * `NBD_INFO_BLOCK_SIZE` (3)
> +
> + Represents the server's advertised block sizes; see the "Block
> + sizes" section for more details on what these values represent,
> + and on constraints on their values. The server MAY send this
> + info whether or not the client has negotiated
> + `NBD_OPT_BLOCK_SIZE`, and SHOULD send this info if it intends to
> + enforce block sizes other than the defaults. The *length* MUST
> + be 14, and the reply payload is interpreted as:
> +
> + - 16 bits, `NBD_INFO_BLOCK_SIZE`
> + - 32 bits, minimum block size
> + - 32 bits, preferred block size
> + - 32 bits, maximum block size
> +
> * `NBD_REP_ERR_UNKNOWN`
>
> The requested export is not available.
>
> +### `BLOCK_SIZE` extension
> +
> +Some servers are able to make optimizations, such as opening files
> +with O_DIRECT, if they know that the client will obey a particular
> +minimum block size, where it must fall back to safer but slower code
> +if the client might send unaligned requests. To facilitate optimum
> +coordination between client and server, a `BLOCK_SIZE` extension is
> +envisioned, which adds one new option request.
> +
> +Note that a client MAY obey non-default block sizes even without
> +advertising intent or even when the server does not advertise block
> +sizes; and that a server MAY advertise block sizes even when a client
> +does not advertise intent. Therefore, the use of this option is
> +independent of whether the server uses `NBD_INFO_BLOCK_SIZE`, as
> +documented in the `INFO` extension.
> +
> +* `NBD_OPT_BLOCK_SIZE`
> +
> + The client wishes to inform the server of its intent to obey block
> + sizes. The option request has no additional data.
> +
> + The server MUST reply with `NBD_REP_ACK`, after which point the
> + client SHOULD use `NBD_OPT_GO` rather than `NBD_OPT_EXPORT_NAME`,
> + and the server SHOULD include `NBD_INFO_BLOCK_SIZE` in its reply.
> + If successfully negotiated, and the server advertises block sizes,
> + the client MUST NOT send unaligned requests.
> +
> + For backwards compatibility, clients SHOULD be prepared to also
> + handle `NBD_REP_ERR_UNSUP`, which means the server SHOULD NOT be
> + advertising block sizes, and the client MAY assume the server will
> + honor default block sizes.
> +
> ### `WRITE_ZEROES` extension
>
> There exist some cases when a client knows that the data it is going to write
> @@ -1426,13 +1560,15 @@ error, and alters the reply to the `NBD_CMD_READ` request.
> be at least 12. This reply represents that an error occurred at
> a given offset, which MUST lie within the original offset and
> length of the request; the client can use this offset to
> - determine if request had any partial success. This chunk type
> - MAY appear multiple times in a structured reply, although the
> - same offset SHOULD NOT be repeated. Likewise, if content chunks
> - were sent earlier in the structured reply, the server SHOULD NOT
> - send multiple distinct offsets that lie within the bounds of a
> - single content chunk. Valid as a reply to `NBD_CMD_READ`,
> - `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
> + determine if request had any partial success. The server MAY
> + use an offset that is not a multiple of any advertised minimum
> + block size. This chunk type MAY appear multiple times in a
> + structured reply, although the same offset SHOULD NOT be
> + repeated. Likewise, if content chunks were sent earlier in the
> + structured reply, the server SHOULD NOT send multiple distinct
> + offsets that lie within the bounds of a single content chunk.
> + Valid as a reply to `NBD_CMD_READ`, `NBD_CMD_WRITE`,
> + `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
>
> The payload is structured as:
>
> @@ -1517,8 +1653,9 @@ error, and alters the reply to the `NBD_CMD_READ` request.
> The server SHOULD return `EOVERFLOW`, rather than `EINVAL`, when a
> client has requested `NBD_CMD_FLAG_DF` for a length that is too
> large to read without fragmentation. The server MUST NOT return
> - this error if the read request did not exceed 65,536 bytes, and
> - SHOULD NOT return this error if `NBD_CMD_FLAG_DF` is not set.
> + this error if the read request did not exceed the larger of 65,536
> + bytes or the advertised preferred block size, and SHOULD NOT
> + return this error if `NBD_CMD_FLAG_DF` is not set.
>
> * `NBD_CMD_READ`
>
> @@ -1539,14 +1676,19 @@ error, and alters the reply to the `NBD_CMD_READ` request.
> The server MAY split the reply into any number of content chunks;
> each chunk MUST describe at least one byte, although to minimize
> overhead, the server SHOULD use chunks with lengths and offsets as
> - an integer multiple of 512 bytes, where possible (the first and
> - last chunk of an unaligned read being the most obvious places for
> - an exception). The server MUST NOT send content chunks that
> - overlap with any earlier content or error chunk, and MUST NOT send
> - chunks that describe data outside the offset and length of the
> - request, but MAY send the content chunks in any order (the client
> - MUST reassemble content chunks into the correct order), and MAY
> - send additional content chunks even after reporting an error chunk.
> + an integer multiple of the preferred block size (whether the
> + advertised value or the default of 4,096 bytes) where that is
> + possible (the first and last chunk of an unaligned read being the
> + most obvious places for an exception). If the server advertised
> + block sizes, it MUST ensure that every chunk has a length and
> + offset which are an integer multiple of the minimum block size.
> +
> + The server MUST NOT send content chunks that overlap with any
> + earlier content or error chunk, and MUST NOT send chunks that
> + describe data outside the offset and length of the request, but
> + MAY send the content chunks in any order (the client MUST
> + reassemble content chunks into the correct order), and MAY send
> + additional content chunks even after reporting an error chunk.
> Note that a request for more than 2^32 - 8 bytes MUST be split
> into at least two chunks, so as not to overflow the length field
> of a reply while still allowing space for the offset of each
> @@ -1601,7 +1743,8 @@ error, and alters the reply to the `NBD_CMD_READ` request.
> if the length is too large to send without fragmentation, in which
> case it MUST NOT send a content chunk; however, the server MUST
> support unfragmented reads in which the client's request length
> - does not exceed 65,536 bytes.
> + does not exceed the larger of 65,536 bytes or the advertised
> + preferred block size.
>
> ## About this file
>
> --
> 2.5.5
>
>
> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications Manager
> Applications Manager provides deep performance insights into multiple tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>
--
Alex Bligh
Reply to: