Re: [PATCH v2 2/3] doc: Clarify maximum size limits

To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>, nbd@other.debian.org
Cc: edgar.kaziakhmedov@virtuozzo.com, w@uter.be
Subject: Re: [PATCH v2 2/3] doc: Clarify maximum size limits
From: Eric Blake <eblake@redhat.com>
Date: Wed, 28 Mar 2018 11:21:01 -0500
Message-id: <[🔎] f3ace18f-d935-e2d5-2f85-0f3f89f01b93@redhat.com>
In-reply-to: <[🔎] 546b95d6-7e5b-fab6-7dc6-3dd927eae344@virtuozzo.com>
References: <[🔎] 20180314211622.784604-1-eblake@redhat.com> <[🔎] 20180314211622.784604-3-eblake@redhat.com> <[🔎] 546b95d6-7e5b-fab6-7dc6-3dd927eae344@virtuozzo.com>

On 03/16/2018 11:10 AM, Vladimir Sementsov-Ogievskiy wrote:

-If block size constraints have not been advertised or agreed onexternally,-then a client SHOULD assume a default minimum block size of 1, apreferred

-block size of 2^12 (4,096), and a maximum block size of the smaller of
-the export size or 0xffffffff (effectively unlimited).  A server that
-wants to enforce block sizes other than the defaults specified here
-MAY refuse to go into transmission phase with a client that uses
-`NBD_OPT_EXPORT_NAME` (via a hard disconnect) or which fails to use
-`NBD_INFO_BLOCK_SIZE` with `NBD_OPT_GO` (where the server uses
-`NBD_REP_ERR_BLOCK_SIZE_REQD`), although a server SHOULD permit such
-clients if block size constraints are the default or can be agreed on
-externally.  When allowing such clients, the server MUST cleanly error
-commands that fall outside block size constraints without corrupting
-data; even so, this may limit interoperability.
+If block size constraints have not been advertised or agreed on


"advertised" here means answer on OPT_INFO or OPT_GO

Mostly yes. The information reported by NBD_OPT_INFO may differ fromthat reported by NBD_OPT_GO if the client did other things in themeantime, or changed its set of requested info; but the spec does statethat if the client passes the same request to both, and NBD_OPT_INFOsucceeded, then NBD_OPT_GO should also report the same results on success.

Qemu server's implementation is interesting: right now, it wants toalways advertise a minimum block size of 512, although I want toeventually relax it to advertise a minimum block size of 1. So the codedoes:



    /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
     * according to whether the client requested it, and according to
     * whether this is OPT_INFO or OPT_GO. */
    /* minimum - 1 for back-compat, or 512 if client is new enough.
     * TODO: consult blk_bs(blk)->bl.request_alignment? */
    sizes[0] =

(client->opt == NBD_OPT_INFO || blocksize) ?BDRV_SECTOR_SIZE : 1;


but follows up with:

    /* If the client is just asking for NBD_OPT_INFO, but forgot to
     * request block sizes, return an error.
     * TODO: consult blk_bs(blk)->request_align, and only error if it
     * is not 1? */
    if (client->opt == NBD_OPT_INFO && !blocksize) {
        return nbd_negotiate_send_rep_err(client,
                                          NBD_REP_ERR_BLOCK_SIZE_REQD,

so the possible outcomes of an INFO followed by GO are:

INFO with size, GO with size: for now, both report 512 (qemu might befixed such that both report 1 in cases where alignment doesn't matter);this is compliant, since both replies match

INFO without size, GO with size: info reports 512 but fails withNBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 512 and succeeds; this iscompliant whether or not both replies matched, because the INFO responsedid not succeed

INFO without size, GO without size: info reports 512 but fails withNBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 1 and succeeds; the resultsdiffer, but this is compliant because the INFO response did not succeed

INFO with size, GO without size: INFO reports 512, GO reports 1; theresults differ, but this is compliant because the client changed therequest between the two calls

So I don't think I'll change this text for v3, but I am pointing outthat "advertised" strictly means only what NBD_OPT_GO replies, as thereare situations where the reply to NBD_OPT_INFO is not final

+externally, then a server SHOULD support a default minimum block size

[1]

+of 1, a preferred block size of 2^12 (4,096), and a maximum block size
+that is effectively unlimited (0xffffffff, or the export size if that
+is smaller), while a client desiring maximum interoperability SHOULD
+constrain its requests to a minimum block size of 2^9 (512), and limit
+`NBD_CMD_READ` and `NBD_CMD_WRITE` commands to a maximum block size of
+2^25 (33,554,432).

for client, defaults: 1 -> 512, unlimited -> 32m. for server -unchanged. And wording is very understandable.

A server that wants to enforce block sizes other

(it's a server, which don't want to follow [1]SHOULD)

+than the defaults specified here MAY refuse to go into transmission
+phase with a client that uses `NBD_OPT_EXPORT_NAME` (via a hard
+disconnect) or which

unchanged

  uses `NBD_OPT_GO` without requesting
+`NBD_INFO_BLOCK_SIZE` (via an error reply of
+`NBD_REP_ERR_BLOCK_SIZE_REQD`); but servers SHOULD NOT refuse clients
+that do not request sizing information when the server supports

actually, all servers SHOULD support these defaults [1]

Note - a server SHOULD support defaults, not MUST. A server that doesnot support the defaults is still compliant, just less portable. Hence,the additional wording that a server that does NOT support the defaultscan still be compliant, but only if it is graceful at either rejectingclients that don't request sizing information, or graceful if it reportsan error on unaligned requests.

(And in the case of qemu advertising 512 byte alignment above - qemuDOES correctly handle the defaults via read-modify-write, even when theclient sends an unaligned request)

+default sizing or where sizing constraints can be agreed on
+externally.  When allowing clients that did not negotiate sizing via
+NBD, a server that enforces stricter block size constraints than the
+defaults MUST cleanly error commands that fall outside the constraints
+without corrupting data;
At first, I wanted to s/corrupting/changing, as not every changing maybe considered as corrupting. But,actually, any changing when we return an error may be considered aswrong and corrupting. So, it should
be OK.

Yes - normally, when you return an error on a WRITE, the client can'ttell how much (if any) of the write was performed before the erroroccurred. The wording here states that if the error is going to be dueto diagnosing misalignment, that diagnosis should occur without changingdisk contents, so that the client can in this limited case make anassumption that the disk did not change when an alignment error isreported (and any write to the disk to the contrary is indeed viewed bythe client as corruption).

(hm, also, actually we allow server to error any command, if constraintsare not advertised)

Yes, that means a fully portable client should try NBD_OPT_GO first (tosee if constraints will be advertised), and then be prepared forcommands to fail if constraints are not advertised after all (eitherbecause NBD_OPT_GO didn't include NBD_REP_INFO with block size, orbecause the server required falling back to NBD_OPT_EXPORT_NAME).

final cases for not advertised constraints:

any server SHOULD support defaults
if server support default, but want not default: it MAY, but SHOULD NOTrefuse
if server doesn't support default:
     - it MAY refuse
     - it SHOULD NOT refuse if constraints can be agreed on externally
     - it MUST error commands that fall outside the constraints
for generic client it all just means, that generic server MAY permit, orMAY refuse, and MAY error commands if permitted connection.(external agreement is a separate case, not about generic server andclient)

Yes, I think we are restating the same thing, which hopefully means myimproved wording is reasonable. I'll resubmit a v3 of this series forWouter's comments.

  The preferred block size represents the minimum size at which aligned
  requests will have efficient I/O, avoiding behaviour such as
  read-modify-write.  If advertised, this MUST be a power of 2 at least
  as large as the smaller of the minimum block size and 2^12 (4,096),


pref_block >= min(min_block, 4K).. So, pref block may be less than
min_block??? it looks like a bug.
(consider min_block = 64K,  pref_block = 4K: 4K >= min(64K, 4K))


Hmm, you're right.  I think it may be better as:

pref_block >= max(min_block, 4K)

In fact, that's probably worth a separate commit, to make the changeobvious rather than buried in the wall of other changes.

  although larger values (such as the minimum granularity of a hole) are
  also appropriate.

don't understand.. "are also" after "MUST"?

If we switch min() to max() in the equation above, then with min_block =1, a pref_block of 4k is valid, but a pref_block of 64k may be moreappropriate if that is the granularity of holes. So the phrase about"are also" is expounding on why pref_block > max(min_block, 4k) makes sense.

   The preferred block size MAY be larger than the
  export size, in which case the client is unable to utilize the
  preferred block size for that export.

Hmm, we we need it? Why not SHOULD NOT?

If you have an efficient block size of 64k, but export a 512-bytevolume, it is easier to code the server to always report a pref_block =64k, even when that exceeds size, than to require the server to reportpref_block = max(export size, 64k). (Particularly if we change thewording above to require that pref_block >= max(min_block, 4k), at whichpoint a pref_block of 512 is invalid.)

Overall, pref_block is the fuzziest of the constructs, because it dealsmore with optimization (what sizes should I use for the least amount ofwork) and not correctness (what alignment must I use, and what is themaximum request I can send).

   The server MAY advertise an
  export size that is not an integer multiple of the preferred block
  size.


  The maximum block size represents the maximum length that the server
-is willing to handle in one request.  If advertised, it MUST be either
-an integer multiple of the minimum block size or the value 0xffffffff
-for no inherent limit, MUST be at least as large as the smaller of the
-preferred block size or export size, and SHOULD be at least 2^25
-(33,554,432) if the export is that large, but MAY be something other
-than a power of 2.  For convenience, the server MAY advertise a
-maximum block size that is larger than the export size, although in
-that case, the client MUST treat the export size as the effective
-maximum block size (as further constrained by a nonzero offset).
+is willing to handle in one request.  If advertised, it MAY be
+something other than a power of 2, but MUST be either an integer

but, for "something other", only 0xffffffff is possible, so actually thefirst

sub sentence is useless, I think it may be dropped. (to be multiple of the
minimum block size involves to be a power of 2)

No. There are iSCSI devices in the wild that advertise a maximum blocksize of 15M, which is not a power of 2, but is an integer multiple ofthe minimum block size (512 or 4k, I'm not sure which).

+multiple of the minimum block size or the value 0xffffffff for no
+inherent limit, MUST be at least as large as the smaller of the
+preferred block size or export size, and SHOULD be at least 2^20
+(1,048,576) if the export is that large.  For convenience, the server
+MAY advertise a maximum block size that is larger than the export
+size, although in that case, the client MUST treat the export size as
+the effective maximum block size (as further constrained by a nonzero
+offset).

So here, only 32M -> 1M for the minimum and "MAY be something other thana power of 2" moved to the top. It's better, but I think it should bedropped at all.


  Where a transmission request can have a nonzero *offset* and/or
  *length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),

@@ -804,9 +808,18 @@ those requests, the client MUST NOT use a*length* larger than any

  advertised maximum block size or which, when added to *offset*, would
  exceed the export size.  The server SHOULD report an `EINVAL` error if
  the client's request is not aligned to advertised minimum block size
-boundaries, or is larger than the advertised maximum block size,
-although the server MAY instead initiate a hard disconnect if a large
-*length* could be deemed as a denial of service attack.
+boundaries, or is larger than the advertised maximum block size.


and what about requests which exceed the export size?

We already cover elsewhere that write/write_zeroes beyond export sizeshould fail with ENOSPC, and other requests beyond export size shouldfail with EINVAL. In the POSIX tradition, if you make a request that isinvalid by more than one possible error code, it is unspecified which ofthose error codes you get, merely that the code you DO get makes sense.

+Notwithstanding any maximum block size advertised, either the server
+or the client MAY initiate a hard disconnect if the payload of an
+`NBD_CMD_WRITE` request or `NBD_CMD_READ` reply would be large enough
+to be deemed a denial of service attack; however, for maximum
+portability, any length less than 2^25 (33,554,432) bytes SHOULD NOT
+be considered a denial of service attack (even if the advertised
+maximum block size is smaller).  For all other commands, where the
+*length* is not reflected in the payload (such as `NBD_CMD_TRIM` or
+`NBD_CMD_WRITE_ZEROES`), a server SHOULD merely fail the command with
+an `EINVAL` error for a client that exceeds the maximum block size,
+rather than initiating a hard disconnect.

about other commands: structured reply to BLOCK_STATUS may be near to4G, so we'll want to change this paragraph a bit, when we mergeBLOCK_STATUS.

Yes, my BLOCK_STATUS patches are a different series, because it is on adifferent branch than master, but I need to repost those as well.


  ## Values


Thank you for understandable rewordings!


We'll see how v3 fares.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Reply to:

Follow-Ups:
- Re: [PATCH v2 2/3] doc: Clarify maximum size limits
  - From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

References:
- [PATCH v2 0/3] More clarification on maximum limits
  - From: Eric Blake <eblake@redhat.com>
- [PATCH v2 2/3] doc: Clarify maximum size limits
  - From: Eric Blake <eblake@redhat.com>
- Re: [PATCH v2 2/3] doc: Clarify maximum size limits
  - From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Prev by Date: Re: [Qemu-devel] [PATCH 3/4] nbd/server: implement dirty bitmap export
Next by Date: Re: [PATCH v2 3/3] doc: Add alternate trim/zero limits
Previous by thread: Re: [PATCH v2 2/3] doc: Clarify maximum size limits
Next by thread: Re: [PATCH v2 2/3] doc: Clarify maximum size limits
Index(es):
- Date
- Thread