[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [PATCH v2 2/3] doc: Clarify maximum size limits



On 03/16/2018 11:10 AM, Vladimir Sementsov-Ogievskiy wrote:

-If block size constraints have not been advertised or agreed on externally, -then a client SHOULD assume a default minimum block size of 1, a preferred
-block size of 2^12 (4,096), and a maximum block size of the smaller of
-the export size or 0xffffffff (effectively unlimited).  A server that
-wants to enforce block sizes other than the defaults specified here
-MAY refuse to go into transmission phase with a client that uses
-`NBD_OPT_EXPORT_NAME` (via a hard disconnect) or which fails to use
-`NBD_INFO_BLOCK_SIZE` with `NBD_OPT_GO` (where the server uses
-`NBD_REP_ERR_BLOCK_SIZE_REQD`), although a server SHOULD permit such
-clients if block size constraints are the default or can be agreed on
-externally.  When allowing such clients, the server MUST cleanly error
-commands that fall outside block size constraints without corrupting
-data; even so, this may limit interoperability.
+If block size constraints have not been advertised or agreed on

"advertised" here means answer on OPT_INFO or OPT_GO

Mostly yes. The information reported by NBD_OPT_INFO may differ from that reported by NBD_OPT_GO if the client did other things in the meantime, or changed its set of requested info; but the spec does state that if the client passes the same request to both, and NBD_OPT_INFO succeeded, then NBD_OPT_GO should also report the same results on success.

Qemu server's implementation is interesting: right now, it wants to always advertise a minimum block size of 512, although I want to eventually relax it to advertise a minimum block size of 1. So the code does:


    /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
     * according to whether the client requested it, and according to
     * whether this is OPT_INFO or OPT_GO. */
    /* minimum - 1 for back-compat, or 512 if client is new enough.
     * TODO: consult blk_bs(blk)->bl.request_alignment? */
    sizes[0] =
(client->opt == NBD_OPT_INFO || blocksize) ? BDRV_SECTOR_SIZE : 1;

but follows up with:

    /* If the client is just asking for NBD_OPT_INFO, but forgot to
     * request block sizes, return an error.
     * TODO: consult blk_bs(blk)->request_align, and only error if it
     * is not 1? */
    if (client->opt == NBD_OPT_INFO && !blocksize) {
        return nbd_negotiate_send_rep_err(client,
                                          NBD_REP_ERR_BLOCK_SIZE_REQD,

so the possible outcomes of an INFO followed by GO are:

INFO with size, GO with size: for now, both report 512 (qemu might be fixed such that both report 1 in cases where alignment doesn't matter); this is compliant, since both replies match

INFO without size, GO with size: info reports 512 but fails with NBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 512 and succeeds; this is compliant whether or not both replies matched, because the INFO response did not succeed

INFO without size, GO without size: info reports 512 but fails with NBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 1 and succeeds; the results differ, but this is compliant because the INFO response did not succeed

INFO with size, GO without size: INFO reports 512, GO reports 1; the results differ, but this is compliant because the client changed the request between the two calls

So I don't think I'll change this text for v3, but I am pointing out that "advertised" strictly means only what NBD_OPT_GO replies, as there are situations where the reply to NBD_OPT_INFO is not final



+externally, then a server SHOULD support a default minimum block size
[1]
+of 1, a preferred block size of 2^12 (4,096), and a maximum block size
+that is effectively unlimited (0xffffffff, or the export size if that
+is smaller), while a client desiring maximum interoperability SHOULD
+constrain its requests to a minimum block size of 2^9 (512), and limit
+`NBD_CMD_READ` and `NBD_CMD_WRITE` commands to a maximum block size of
+2^25 (33,554,432).

for client, defaults: 1 -> 512, unlimited -> 32m. for server - unchanged. And wording is very understandable.


A server that wants to enforce block sizes other
(it's a server, which don't want to follow [1]SHOULD)

+than the defaults specified here MAY refuse to go into transmission
+phase with a client that uses `NBD_OPT_EXPORT_NAME` (via a hard
+disconnect) or which
unchanged


  uses `NBD_OPT_GO` without requesting
+`NBD_INFO_BLOCK_SIZE` (via an error reply of
+`NBD_REP_ERR_BLOCK_SIZE_REQD`); but servers SHOULD NOT refuse clients
+that do not request sizing information when the server supports
actually, all servers SHOULD support these defaults [1]

Note - a server SHOULD support defaults, not MUST. A server that does not support the defaults is still compliant, just less portable. Hence, the additional wording that a server that does NOT support the defaults can still be compliant, but only if it is graceful at either rejecting clients that don't request sizing information, or graceful if it reports an error on unaligned requests.

(And in the case of qemu advertising 512 byte alignment above - qemu DOES correctly handle the defaults via read-modify-write, even when the client sends an unaligned request)


+default sizing or where sizing constraints can be agreed on
+externally.  When allowing clients that did not negotiate sizing via
+NBD, a server that enforces stricter block size constraints than the
+defaults MUST cleanly error commands that fall outside the constraints
+without corrupting data;
At first, I wanted to s/corrupting/changing, as not every changing may be considered as corrupting. But, actually, any changing when we return an error may be considered as wrong and corrupting. So, it should
be OK.

Yes - normally, when you return an error on a WRITE, the client can't tell how much (if any) of the write was performed before the error occurred. The wording here states that if the error is going to be due to diagnosing misalignment, that diagnosis should occur without changing disk contents, so that the client can in this limited case make an assumption that the disk did not change when an alignment error is reported (and any write to the disk to the contrary is indeed viewed by the client as corruption).


(hm, also, actually we allow server to error any command, if constraints are not advertised)

Yes, that means a fully portable client should try NBD_OPT_GO first (to see if constraints will be advertised), and then be prepared for commands to fail if constraints are not advertised after all (either because NBD_OPT_GO didn't include NBD_REP_INFO with block size, or because the server required falling back to NBD_OPT_EXPORT_NAME).



final cases for not advertised constraints:

any server SHOULD support defaults
if server support default, but want not default: it MAY, but SHOULD NOT refuse
if server doesn't support default:
     - it MAY refuse
     - it SHOULD NOT refuse if constraints can be agreed on externally
     - it MUST error commands that fall outside the constraints

for generic client it all just means, that generic server MAY permit, or MAY refuse, and MAY error commands if permitted connection. (external agreement is a separate case, not about generic server and client)

Yes, I think we are restating the same thing, which hopefully means my improved wording is reasonable. I'll resubmit a v3 of this series for Wouter's comments.

  The preferred block size represents the minimum size at which aligned
  requests will have efficient I/O, avoiding behaviour such as
  read-modify-write.  If advertised, this MUST be a power of 2 at least
  as large as the smaller of the minimum block size and 2^12 (4,096),

pref_block >= min(min_block, 4K).. So, pref block may be less than
min_block??? it looks like a bug.
(consider min_block = 64K,  pref_block = 4K: 4K >= min(64K, 4K))

Hmm, you're right.  I think it may be better as:

pref_block >= max(min_block, 4K)

In fact, that's probably worth a separate commit, to make the change obvious rather than buried in the wall of other changes.


  although larger values (such as the minimum granularity of a hole) are
  also appropriate.
don't understand.. "are also" after "MUST"?

If we switch min() to max() in the equation above, then with min_block = 1, a pref_block of 4k is valid, but a pref_block of 64k may be more appropriate if that is the granularity of holes. So the phrase about "are also" is expounding on why pref_block > max(min_block, 4k) makes sense.


   The preferred block size MAY be larger than the
  export size, in which case the client is unable to utilize the
  preferred block size for that export.
Hmm, we we need it? Why not SHOULD NOT?


If you have an efficient block size of 64k, but export a 512-byte volume, it is easier to code the server to always report a pref_block = 64k, even when that exceeds size, than to require the server to report pref_block = max(export size, 64k). (Particularly if we change the wording above to require that pref_block >= max(min_block, 4k), at which point a pref_block of 512 is invalid.)

Overall, pref_block is the fuzziest of the constructs, because it deals more with optimization (what sizes should I use for the least amount of work) and not correctness (what alignment must I use, and what is the maximum request I can send).


   The server MAY advertise an
  export size that is not an integer multiple of the preferred block
  size.


  The maximum block size represents the maximum length that the server
-is willing to handle in one request.  If advertised, it MUST be either
-an integer multiple of the minimum block size or the value 0xffffffff
-for no inherent limit, MUST be at least as large as the smaller of the
-preferred block size or export size, and SHOULD be at least 2^25
-(33,554,432) if the export is that large, but MAY be something other
-than a power of 2.  For convenience, the server MAY advertise a
-maximum block size that is larger than the export size, although in
-that case, the client MUST treat the export size as the effective
-maximum block size (as further constrained by a nonzero offset).
+is willing to handle in one request.  If advertised, it MAY be
+something other than a power of 2, but MUST be either an integer

but, for "something other", only 0xffffffff is possible, so actually the first
sub sentence is useless, I think it may be dropped. (to be multiple of the
minimum block size involves to be a power of 2)

No. There are iSCSI devices in the wild that advertise a maximum block size of 15M, which is not a power of 2, but is an integer multiple of the minimum block size (512 or 4k, I'm not sure which).


+multiple of the minimum block size or the value 0xffffffff for no
+inherent limit, MUST be at least as large as the smaller of the
+preferred block size or export size, and SHOULD be at least 2^20
+(1,048,576) if the export is that large.  For convenience, the server
+MAY advertise a maximum block size that is larger than the export
+size, although in that case, the client MUST treat the export size as
+the effective maximum block size (as further constrained by a nonzero
+offset).

So here, only 32M -> 1M for the minimum and "MAY be something other than a power of 2" moved to the top. It's better, but I think it should be dropped at all.



  Where a transmission request can have a nonzero *offset* and/or
  *length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
@@ -804,9 +808,18 @@ those requests, the client MUST NOT use a *length* larger than any
  advertised maximum block size or which, when added to *offset*, would
  exceed the export size.  The server SHOULD report an `EINVAL` error if
  the client's request is not aligned to advertised minimum block size
-boundaries, or is larger than the advertised maximum block size,
-although the server MAY instead initiate a hard disconnect if a large
-*length* could be deemed as a denial of service attack.
+boundaries, or is larger than the advertised maximum block size.

and what about requests which exceed the export size?

We already cover elsewhere that write/write_zeroes beyond export size should fail with ENOSPC, and other requests beyond export size should fail with EINVAL. In the POSIX tradition, if you make a request that is invalid by more than one possible error code, it is unspecified which of those error codes you get, merely that the code you DO get makes sense.


+Notwithstanding any maximum block size advertised, either the server
+or the client MAY initiate a hard disconnect if the payload of an
+`NBD_CMD_WRITE` request or `NBD_CMD_READ` reply would be large enough
+to be deemed a denial of service attack; however, for maximum
+portability, any length less than 2^25 (33,554,432) bytes SHOULD NOT
+be considered a denial of service attack (even if the advertised
+maximum block size is smaller).  For all other commands, where the
+*length* is not reflected in the payload (such as `NBD_CMD_TRIM` or
+`NBD_CMD_WRITE_ZEROES`), a server SHOULD merely fail the command with
+an `EINVAL` error for a client that exceeds the maximum block size,
+rather than initiating a hard disconnect.

about other commands: structured reply to BLOCK_STATUS may be near to 4G, so we'll want to change this paragraph a bit, when we merge BLOCK_STATUS.

Yes, my BLOCK_STATUS patches are a different series, because it is on a different branch than master, but I need to repost those as well.



  ## Values


Thank you for understandable rewordings!

We'll see how v3 fares.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Reply to: