[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[PATCH v4] doc: Add alternate trim/zero limits



A recent patch (6cbc0d5a) mentioned that a server that honors
larger TRIM/WRITE_ZEROES requests than what it accepts for WRITE
has to choose whether to advertise the maximum block size as the
smaller limit at which it does hard disconnect for WRITE, or the
larger limit at which it returns EINVAL for too-large trim/zero.
Let's make the situation less ambiguous by allowing a client and
server to negotiate explicit (larger) alternate limits for these
two commands.  We are using the fact that NBD_OPT_GO already
documents that the client may request additional NBD_INFO items,
the server may reply with unrequested additional NBD_INFO items,
and that both sides must ignore NBD_INFO items that they don't
recognize.

Note that since there are existing clients that do not expect
the new info items, the server MUST NOT fail a request that is
unaligned to the (larger) minimum trim/zero size with EINVAL;
only something unaligned to the main minimum block size can
result in EINVAL.  Rather, the new minimum sizes are for the
client's information about what is efficient.

Signed-off-by: Eric Blake <eblake@redhat.com>

---

As a proof-of-concept implementation, I've prepared a patch set for
qemu to quickly implement this; I tested that all client-server
combinations (old-old, old-new, new-old, new-new) were able to
negotiate, and that only in the new-new combination, qemu was
able to take advantage of larger write zero requests.

https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg00135.html

v3 was here:
https://lists.debian.org/nbd/2018/03/msg00048.html
In v4: incorporate a few more wording tweaks based on Vladimir's review
---
 doc/proto.md | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 6 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 6cc0fe2..32a36ba 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -799,6 +799,12 @@ size, although in that case, the client MUST treat the export size as
 the effective maximum block size (as further constrained by a nonzero
 offset).

+The server MAY support individual commands that support a larger
+length than the stated maximum block length (such as `NBD_CMD_TRIM`
+and `NBD_CMD_WRITE_ZEROES`); in such cases, the server SHOULD
+advertise the alternative limits as part of the information exchanged
+during `NBD_OPT_INFO` and `NBD_OPT_GO`.
+
 Where a transmission request can have a nonzero *offset* and/or
 *length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
 the client MUST ensure that *offset* and *length* are integer
@@ -818,8 +824,8 @@ be considered a denial of service attack (even if the advertised
 maximum block size is smaller).  For all other commands, where the
 *length* is not reflected in the payload (such as `NBD_CMD_TRIM` or
 `NBD_CMD_WRITE_ZEROES`), a server SHOULD merely fail the command with
-an `EINVAL` error for a client that exceeds the maximum block size,
-rather than initiating a hard disconnect.
+an `EINVAL` error for a client that exceeds the maximum size for that
+command, rather than initiating a hard disconnect.

 ## Metadata querying

@@ -1518,6 +1524,53 @@ during option haggling in the fixed newstyle negotiation.
       - 32 bits, preferred block size  
       - 32 bits, maximum block size  

+    * `NBD_INFO_TRIM_SIZE` (4)
+
+      Represents alternative limits that the server will honour during
+      `NBD_CMD_TRIM`.  The server MUST NOT send this info unless it
+      also sends `NBD_INFO_BLOCK_SIZE`, and SHOULD NOT send this info
+      unless it will also be advertising the transmission flag
+      `NBD_CMD_SEND_TRIM`.  The minimum trim size SHOULD be a power of
+      2, and MUST be at least as large as the preferred block size
+      advertised in `NBD_INFO_BLOCK_SIZE`; it represents the alignment
+      and minimum granularity that can be discarded.  When a client
+      trim request is aligned to the minimum block size but not the
+      minimum trim size, the server MUST NOT fail that request with
+      `EINVAL`, but MAY round the request to the aligned subsest, if
+      any.  The maximum trim size MUST be either 0xffffffff (for no
+      inherent 32-bit limit) or a multiple of the minimum trim size,
+      and MUST be at least as large as the maximum block size
+      advertised in `NBD_INFO_BLOCK_SIZE`.  The *length* MUST be 10,
+      and the reply payload is interpreted as:
+
+      - 16 bits, `NBD_INFO_TRIM_SIZE`  
+      - 32 bits, minimum trim size  
+      - 32 bits, maximum trim size  
+
+    * `NBD_INFO_ZERO_SIZE` (5)
+
+      Represents alternative limits that the server will honour during
+      `NBD_CMD_WRITE_ZEROES`.  The server MUST NOT send this info
+      unless it also sends `NBD_INFO_BLOCK_SIZE`, and SHOULD NOT send
+      this info unless it will also be advertising the transmission
+      flag `NBD_CMD_SEND_WRITE_ZEROES`.  The minimum zero size SHOULD
+      be a power of 2, and MUST be at least as large as the preferred
+      block size advertised in `NBD_INFO_BLOCK_SIZE`; it represents
+      the alignment and minimum granularity that can be efficiently
+      written as zeroes.  When a client zero request is aligned to the
+      minimum block size but not the minimum zero size, the server
+      MUST NOT fail that request with `EINVAL`, but MUST write zeroes
+      (even if by less efficient means).  The maximum zero size MUST
+      be either 0xffffffff (for no inherent 32-bit limit) or a
+      multiple of the minimum zero size, and MUST be at least as large
+      as the maximum block size advertised in `NBD_INFO_BLOCK_SIZE`.
+      The *length* MUST be 10, and the reply payload is interpreted
+      as:
+
+      - 16 bits, `NBD_INFO_ZERO_SIZE`  
+      - 32 bits, minimum zero size  
+      - 32 bits, maximum zero size  
+
 * `NBD_REP_META_CONTEXT` (4)

     A description of a metadata context. Data:
@@ -1794,6 +1847,9 @@ The following request types exist:
     negotiated using `NBD_OPT_STRUCTURED_REPLY`. The client SHOULD NOT
     request a read length of 0; the behavior of a server on such a
     request is unspecified although the server SHOULD NOT disconnect.
+    The optional `NBD_INFO_BLOCK_SIZE` information sent during
+    `NBD_OPT_GO` can provide more details about any read alignment
+    constraints.

     *Simple replies*

@@ -1893,7 +1949,9 @@ The following request types exist:
     *length* number of bytes to be written to the device. The client
     SHOULD NOT request a write length of 0; the behavior of a server on
     such a request is unspecified although the server SHOULD NOT
-    disconnect.
+    disconnect.  The optional `NBD_INFO_BLOCK_SIZE` information sent
+    during `NBD_OPT_GO` can provide more details about any write
+    alignment constraints.

     The server MUST write the data to disk, and then send the reply
     message. The server MAY send the reply message before the data has
@@ -1936,7 +1994,9 @@ The following request types exist:
     a portion of the client's request is actually discarded. The
     client SHOULD NOT request a trim length of 0; the behavior of a
     server on such a request is unspecified although the server SHOULD
-    NOT disconnect.
+    NOT disconnect.  The optional `NBD_INFO_BLOCK_SIZE` and
+    `NBD_INFO_TRIM_SIZE` information sent during `NBD_OPT_GO` can
+    provide more details about any trim alignment constraints.

     After issuing this command, a client MUST NOT make any assumptions
     about the contents of the export affected by this command, until
@@ -1965,7 +2025,10 @@ The following request types exist:
     command flag `NBD_CMD_FLAG_NO_HOLE` to inform the server that the
     area MUST be fully provisioned, ensuring that future writes to the
     same area will not cause fragmentation or cause failure due to
-    insufficient space.
+    insufficient space.  The optional `NBD_INFO_BLOCK_SIZE` and
+    `NBD_INFO_ZERO_SIZE` information sent during `NBD_OPT_GO` can
+    provide more details for using alignments that can optimize the
+    speed at which a zero request can be handled.

     If an error occurs, the server MUST set the appropriate error code
     in the error field.
@@ -1979,7 +2042,9 @@ The following request types exist:
     A block status query request. Length and offset define the range
     of interest.  The client SHOULD NOT request a status length of 0;
     the behavior of a server on such a request is unspecified although
-    the server SHOULD NOT disconnect.
+    the server SHOULD NOT disconnect.  The optional
+    `NBD_INFO_BLOCK_SIZE` information sent during `NBD_OPT_GO` can
+    provide more details about any status alignment constraints.

     A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless within the
     negotiation phase it sent `NBD_OPT_SET_META_CONTEXT` at least
-- 
2.14.3


Reply to: