Re: [PATCH v2 2/3] doc: Clarify maximum size limits
On 03/16/2018 11:10 AM, Vladimir Sementsov-Ogievskiy wrote:
-If block size constraints have not been advertised or agreed on
externally,
-then a client SHOULD assume a default minimum block size of 1, a
preferred
-block size of 2^12 (4,096), and a maximum block size of the smaller of
-the export size or 0xffffffff (effectively unlimited). A server that
-wants to enforce block sizes other than the defaults specified here
-MAY refuse to go into transmission phase with a client that uses
-`NBD_OPT_EXPORT_NAME` (via a hard disconnect) or which fails to use
-`NBD_INFO_BLOCK_SIZE` with `NBD_OPT_GO` (where the server uses
-`NBD_REP_ERR_BLOCK_SIZE_REQD`), although a server SHOULD permit such
-clients if block size constraints are the default or can be agreed on
-externally. When allowing such clients, the server MUST cleanly error
-commands that fall outside block size constraints without corrupting
-data; even so, this may limit interoperability.
+If block size constraints have not been advertised or agreed on
"advertised" here means answer on OPT_INFO or OPT_GO
Mostly yes. The information reported by NBD_OPT_INFO may differ from
that reported by NBD_OPT_GO if the client did other things in the
meantime, or changed its set of requested info; but the spec does state
that if the client passes the same request to both, and NBD_OPT_INFO
succeeded, then NBD_OPT_GO should also report the same results on success.
Qemu server's implementation is interesting: right now, it wants to
always advertise a minimum block size of 512, although I want to
eventually relax it to advertise a minimum block size of 1. So the code
does:
/* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
* according to whether the client requested it, and according to
* whether this is OPT_INFO or OPT_GO. */
/* minimum - 1 for back-compat, or 512 if client is new enough.
* TODO: consult blk_bs(blk)->bl.request_alignment? */
sizes[0] =
(client->opt == NBD_OPT_INFO || blocksize) ?
BDRV_SECTOR_SIZE : 1;
but follows up with:
/* If the client is just asking for NBD_OPT_INFO, but forgot to
* request block sizes, return an error.
* TODO: consult blk_bs(blk)->request_align, and only error if it
* is not 1? */
if (client->opt == NBD_OPT_INFO && !blocksize) {
return nbd_negotiate_send_rep_err(client,
NBD_REP_ERR_BLOCK_SIZE_REQD,
so the possible outcomes of an INFO followed by GO are:
INFO with size, GO with size: for now, both report 512 (qemu might be
fixed such that both report 1 in cases where alignment doesn't matter);
this is compliant, since both replies match
INFO without size, GO with size: info reports 512 but fails with
NBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 512 and succeeds; this is
compliant whether or not both replies matched, because the INFO response
did not succeed
INFO without size, GO without size: info reports 512 but fails with
NBD_REP_ERR_BLOCK_SIZE_REQD, GO reports 1 and succeeds; the results
differ, but this is compliant because the INFO response did not succeed
INFO with size, GO without size: INFO reports 512, GO reports 1; the
results differ, but this is compliant because the client changed the
request between the two calls
So I don't think I'll change this text for v3, but I am pointing out
that "advertised" strictly means only what NBD_OPT_GO replies, as there
are situations where the reply to NBD_OPT_INFO is not final
+externally, then a server SHOULD support a default minimum block size
[1]
+of 1, a preferred block size of 2^12 (4,096), and a maximum block size
+that is effectively unlimited (0xffffffff, or the export size if that
+is smaller), while a client desiring maximum interoperability SHOULD
+constrain its requests to a minimum block size of 2^9 (512), and limit
+`NBD_CMD_READ` and `NBD_CMD_WRITE` commands to a maximum block size of
+2^25 (33,554,432).
for client, defaults: 1 -> 512, unlimited -> 32m. for server -
unchanged. And wording is very understandable.
A server that wants to enforce block sizes other
(it's a server, which don't want to follow [1]SHOULD)
+than the defaults specified here MAY refuse to go into transmission
+phase with a client that uses `NBD_OPT_EXPORT_NAME` (via a hard
+disconnect) or which
unchanged
uses `NBD_OPT_GO` without requesting
+`NBD_INFO_BLOCK_SIZE` (via an error reply of
+`NBD_REP_ERR_BLOCK_SIZE_REQD`); but servers SHOULD NOT refuse clients
+that do not request sizing information when the server supports
actually, all servers SHOULD support these defaults [1]
Note - a server SHOULD support defaults, not MUST. A server that does
not support the defaults is still compliant, just less portable. Hence,
the additional wording that a server that does NOT support the defaults
can still be compliant, but only if it is graceful at either rejecting
clients that don't request sizing information, or graceful if it reports
an error on unaligned requests.
(And in the case of qemu advertising 512 byte alignment above - qemu
DOES correctly handle the defaults via read-modify-write, even when the
client sends an unaligned request)
+default sizing or where sizing constraints can be agreed on
+externally. When allowing clients that did not negotiate sizing via
+NBD, a server that enforces stricter block size constraints than the
+defaults MUST cleanly error commands that fall outside the constraints
+without corrupting data;
At first, I wanted to s/corrupting/changing, as not every changing may
be considered as corrupting. But,
actually, any changing when we return an error may be considered as
wrong and corrupting. So, it should
be OK.
Yes - normally, when you return an error on a WRITE, the client can't
tell how much (if any) of the write was performed before the error
occurred. The wording here states that if the error is going to be due
to diagnosing misalignment, that diagnosis should occur without changing
disk contents, so that the client can in this limited case make an
assumption that the disk did not change when an alignment error is
reported (and any write to the disk to the contrary is indeed viewed by
the client as corruption).
(hm, also, actually we allow server to error any command, if constraints
are not advertised)
Yes, that means a fully portable client should try NBD_OPT_GO first (to
see if constraints will be advertised), and then be prepared for
commands to fail if constraints are not advertised after all (either
because NBD_OPT_GO didn't include NBD_REP_INFO with block size, or
because the server required falling back to NBD_OPT_EXPORT_NAME).
final cases for not advertised constraints:
any server SHOULD support defaults
if server support default, but want not default: it MAY, but SHOULD NOT
refuse
if server doesn't support default:
- it MAY refuse
- it SHOULD NOT refuse if constraints can be agreed on externally
- it MUST error commands that fall outside the constraints
for generic client it all just means, that generic server MAY permit, or
MAY refuse, and MAY error commands if permitted connection.
(external agreement is a separate case, not about generic server and
client)
Yes, I think we are restating the same thing, which hopefully means my
improved wording is reasonable. I'll resubmit a v3 of this series for
Wouter's comments.
The preferred block size represents the minimum size at which aligned
requests will have efficient I/O, avoiding behaviour such as
read-modify-write. If advertised, this MUST be a power of 2 at least
as large as the smaller of the minimum block size and 2^12 (4,096),
pref_block >= min(min_block, 4K).. So, pref block may be less than
min_block??? it looks like a bug.
(consider min_block = 64K, pref_block = 4K: 4K >= min(64K, 4K))
Hmm, you're right. I think it may be better as:
pref_block >= max(min_block, 4K)
In fact, that's probably worth a separate commit, to make the change
obvious rather than buried in the wall of other changes.
although larger values (such as the minimum granularity of a hole) are
also appropriate.
don't understand.. "are also" after "MUST"?
If we switch min() to max() in the equation above, then with min_block =
1, a pref_block of 4k is valid, but a pref_block of 64k may be more
appropriate if that is the granularity of holes. So the phrase about
"are also" is expounding on why pref_block > max(min_block, 4k) makes sense.
The preferred block size MAY be larger than the
export size, in which case the client is unable to utilize the
preferred block size for that export.
Hmm, we we need it? Why not SHOULD NOT?
If you have an efficient block size of 64k, but export a 512-byte
volume, it is easier to code the server to always report a pref_block =
64k, even when that exceeds size, than to require the server to report
pref_block = max(export size, 64k). (Particularly if we change the
wording above to require that pref_block >= max(min_block, 4k), at which
point a pref_block of 512 is invalid.)
Overall, pref_block is the fuzziest of the constructs, because it deals
more with optimization (what sizes should I use for the least amount of
work) and not correctness (what alignment must I use, and what is the
maximum request I can send).
The server MAY advertise an
export size that is not an integer multiple of the preferred block
size.
The maximum block size represents the maximum length that the server
-is willing to handle in one request. If advertised, it MUST be either
-an integer multiple of the minimum block size or the value 0xffffffff
-for no inherent limit, MUST be at least as large as the smaller of the
-preferred block size or export size, and SHOULD be at least 2^25
-(33,554,432) if the export is that large, but MAY be something other
-than a power of 2. For convenience, the server MAY advertise a
-maximum block size that is larger than the export size, although in
-that case, the client MUST treat the export size as the effective
-maximum block size (as further constrained by a nonzero offset).
+is willing to handle in one request. If advertised, it MAY be
+something other than a power of 2, but MUST be either an integer
but, for "something other", only 0xffffffff is possible, so actually the
first
sub sentence is useless, I think it may be dropped. (to be multiple of the
minimum block size involves to be a power of 2)
No. There are iSCSI devices in the wild that advertise a maximum block
size of 15M, which is not a power of 2, but is an integer multiple of
the minimum block size (512 or 4k, I'm not sure which).
+multiple of the minimum block size or the value 0xffffffff for no
+inherent limit, MUST be at least as large as the smaller of the
+preferred block size or export size, and SHOULD be at least 2^20
+(1,048,576) if the export is that large. For convenience, the server
+MAY advertise a maximum block size that is larger than the export
+size, although in that case, the client MUST treat the export size as
+the effective maximum block size (as further constrained by a nonzero
+offset).
So here, only 32M -> 1M for the minimum and "MAY be something other than
a power of 2" moved to the top. It's better, but I think it should be
dropped at all.
Where a transmission request can have a nonzero *offset* and/or
*length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
@@ -804,9 +808,18 @@ those requests, the client MUST NOT use a
*length* larger than any
advertised maximum block size or which, when added to *offset*, would
exceed the export size. The server SHOULD report an `EINVAL` error if
the client's request is not aligned to advertised minimum block size
-boundaries, or is larger than the advertised maximum block size,
-although the server MAY instead initiate a hard disconnect if a large
-*length* could be deemed as a denial of service attack.
+boundaries, or is larger than the advertised maximum block size.
and what about requests which exceed the export size?
We already cover elsewhere that write/write_zeroes beyond export size
should fail with ENOSPC, and other requests beyond export size should
fail with EINVAL. In the POSIX tradition, if you make a request that is
invalid by more than one possible error code, it is unspecified which of
those error codes you get, merely that the code you DO get makes sense.
+Notwithstanding any maximum block size advertised, either the server
+or the client MAY initiate a hard disconnect if the payload of an
+`NBD_CMD_WRITE` request or `NBD_CMD_READ` reply would be large enough
+to be deemed a denial of service attack; however, for maximum
+portability, any length less than 2^25 (33,554,432) bytes SHOULD NOT
+be considered a denial of service attack (even if the advertised
+maximum block size is smaller). For all other commands, where the
+*length* is not reflected in the payload (such as `NBD_CMD_TRIM` or
+`NBD_CMD_WRITE_ZEROES`), a server SHOULD merely fail the command with
+an `EINVAL` error for a client that exceeds the maximum block size,
+rather than initiating a hard disconnect.
about other commands: structured reply to BLOCK_STATUS may be near to
4G, so we'll want to change this paragraph a bit, when we merge
BLOCK_STATUS.
Yes, my BLOCK_STATUS patches are a different series, because it is on a
different branch than master, but I need to repost those as well.
## Values
Thank you for understandable rewordings!
We'll see how v3 fares.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Reply to: