Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes

To: Alex Bligh <alex@...872...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
Subject: Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
From: Eric Blake <eblake@...696...>
Date: Thu, 14 Apr 2016 10:41:47 -0600
Message-id: <570FC84B.8070405@...696...>
In-reply-to: <598E25F1-F1AC-4523-AA91-85E5B06F2693@...872...>
References: <1460594250-12643-1-git-send-email-eblake@...696...> <1460594250-12643-3-git-send-email-eblake@...696...> <598E25F1-F1AC-4523-AA91-85E5B06F2693@...872...>

On 04/14/2016 05:05 AM, Alex Bligh wrote:
> 
> On 14 Apr 2016, at 01:37, Eric Blake <eblake@...696...> wrote:
>>
>> +## Block sizes
>> +
>> +During transmission phase, several operations are constrained by the
>> +export size sent in reply to `NBD_OPT_EXPORT_NAME` or `NBD_OPT_GO`
> 
> or NBD_OPT_INFO?

Not really - while NBD_OPT_INFO followed by NBD_OPT_GO _should_ send the
same export size, it is feasible that an intermediate NBD_OPT_ could
cause the server to advertise a different size; and the transmission
phase is bounded solely by the final advertised export size (ie. only
the two commands that can kick off transmission phase), not by earlier
advertisements.

> 
>> , as
>> +well as by three block sizes defined here (minimum, preferred, and
>> +maximum).  During the handshake phase, a client MAY announce its
>> +intention to honor server block sizes via the experimental
>> +`BLOCK_SIZE` extension; see below.  A client that is worried about
>> +block size constraints SHOULD use `NBD_OPT_GO` rather than
>> +`NBD_OPT_EXPORT_NAME`.  A server SHOULD advertise the block size
>> +contraints during handshake phase via the experimental `INFO`
>> +extension; see below.
> 
> I agree, but this is a bit circular, given this is an experimental
> extension and thus there is no RFC2119 obligation to implement it.
> I think we should leave this in though as a reminded for us to
> deal with it when incorporated into the main standard.

We're getting closer. I've already posted one version of qemu patches
for support for NBD_OPT_INFO, and I've been keeping them up-to-date with
my tweaking here in this patch churn; I'll post an updated working
implementation once we think we have the wording correct.

> 
>>  A server and client MAY agree on block sizes
>> +via out of band means.  Since a server cannot advertise block sizes
>> +for clients
> 
> s/for/to/
> 
>> that use `NBD_OPT_EXPORT_NAME`, a server that wants to
>> +enforce block sizes other than the defaults specified here MUST
>> +support the experimental `INFO` extension, and MAY refuse to connect
>> +to a client
> 
> "MAY refuse to allow a connection from a client" (clients connect to
> servers, not vice versa), but see below.
> 
>> that does not use `NBD_OPT_GO`, but SHOULD NOT refuse to
>> +connect to a client that does not use `NBD_OPT_BLOCK_SIZE`.
> 
> Really? If I've written my O_DIRECT only server, and a client
> connects to it and says it's not going to respect my minimum
> block size, I should permit a connection and fail operations
> that I can't perform?
> 
> I think actually both of these are wrong. In both instances the
> server should permit the *connection* (the TCP connection) but
> they should error going into transmission mode if they don't
> like the fact the client doesn't support block sizes.
> 
> I don't see a necessity to error if it connects via NBD_OPT_EXPORT_NAME
> and the client has a minimum block size (this seems like a recipe
> to prevent qemu-nbd and qemu connecting to each other), and you've
> already said clients/servers MAY exchange this information out of band,
> so I think disconnecting solely because the client has used
> NBD_OPT_EXPORT_NAME is probably something we should not encourage
> with a specific 'MAY'.
> 
> And in any case "MAY refuse to allow a connection from a client" again.

Okay, I'll drop the text about 'refusing a client that uses
NBD_OPT_EXPORT_NAME' (the idea of out-of-band communication is an
important factor), and fold in your word-smithing.

> 
>> +
>> +If block sizes have not been advertised or agreed on externally, then
>> +a client SHOULD assume a default minimum block size of 1, a preferred
>> +block size of 4,096, and no inherent maximum block size; but MAY
>> +choose to operate as if other block sizes had been specified (for
>> +example, by using a minimum block size of 512, a preferred block size
>> +of 65,536, and a maximum block size of 3,355,442 (32M)).
> 
> I'd prefer that rather than 'but MAY' (which somewhat undoes the 'SHOULD')
> we have "if the server chooses to operate ... [above text] ... then
> a consequence will be that a client with which it has not negotiated
> these different block sizes out of band may send requests (e.g. requests
> smaller than 512 bytes or larger than 32M in this case) which the
> server cannot service." - i.e. we should point out the consequences
> of breaching the 'SHOULD' rather than encouraging it.
> 
>>  A server
>> +that does not advertise block sizes (including via external agreement)
>> +SHOULD NOT reject operations due to alignment constraints (that is, it
>> +SHOULD treat the minimum block size as 1),
> 
> Hang on, we've already said it SHOULD assume a default minimum block
> size of 1 ...
> 
>> and SHOULD have no inherent
>> +maximum block size,
> 
> ... and that
> 
> so I'd forget those two, but
> 
>> although it MAY still terminate the connection on
>> +any request where the length is large enough to be deemed a denial of
>> +service attack.
> 
> this phrase should remain, but should also apply to servers that
> did explicitly specify an explicit maximum block size of infinity (-1).
> 
> Also "terminate the connection" -> "initiate a hard disconnect" (no
> harm in using the new language even if the whole of the patch
> has not been taken.
> 
> Also the DoS provision should apply to the client (a server could
> send a pile of unwanted stuff).
> 
> How about deleting this whole sentence and saying.
> 
> "Nothwithstanding any maximum block size advertised, either
> the server or the client MAY initiate a hard disconnect if
> the size of a request or a reply is large enough to be deemed a denial
> of service attack."

Nice - and seems to fit what qemu does (both client and server will
disconnect any time the peer tries to send more than 32M in one go, on
the grounds that either the peer is malicious, or we've gotten
out-of-sync in reading traffic from the peer and there's no sane way to
recover)

> 
>> A server that does advertise block sizes MUST be
>> +prepared for clients that ignore the advertisement, by, at the very
>> +least, sending appropriate errors, and MUST NOT corrupt data in that
>> +case.
> 
> If we permit servers not avoid entering transmission mode with
> clients that don't respect block sizes, and the server also
> advertises block sizes, then if the client uses offsets / lengths
> that don't respect the block size rules, this is violation of the
> protocol and servers are already permitted to disconnect. The
> problem is thus with clients that don't claim they will respect
> block sizes.
> 
> I think rather we should say:
> 
> A server that advertises block sizes and [enters transmission
> mode with a client that does not advertise support for block
> sizes] MUST cleanly error commands that fall outside those blocksize
> parameters and not corrupt data.
> 
> bit in [] optional I suppose.

I'm wondering if there's some way to also cover a server that is willing
to advertise block sizes, but allows a client to enter transmission
phase with NBD_OPT_EXPORT_NAME, also falls in this category.

> 
> 
>> +The minimum block size represents the smallest addressable length and
>> +alignment within the export, although writing to an area that small
>> +may require the server to use a less-efficient read-modify-write
>> +action.  If advertised, this value MUST be a power of 2, MUST NOT be
>> +larger than 65,536, and MAY be as small as 1 for an export backed by a
>> +regular file, although the values of 512 or 4,096 are more typical for
>> +an export backed by a block device.  If a server advertises a minimum
>> +block size, the advertised export size MUST be an integer multiple of
>> +that block size.
>> +
>> +The preferred block size represents the minimum size at which aligned
>> +requests will have efficient I/O, avoiding behaviour such as
>> +read-modify-write.  If advertised, this MUST be a power of 2 at least
>> +as large as the smaller of the minimum block size and 4,096, although
>> +larger values (such as the minimum granularity of a hole) are also
>> +appropriate.  The preferred block size MAY be larger than the export
>> +size, in which case the client is unable to utilize the preferred
>> +block size for that export.
>> +
>> +The maximum block size represents the maximum length that the server
>> +is willing to handle in one request.  If advertised, it MUST be either
>> +an integer multiple of the minimum block size or the value
>> +4,294,967,295 (that is, (uint32_t)-1) for no inherent limit, MUST be
>> +at least as large as the preferred block size, and SHOULD be at least
>> +3,355,442 (32M), but MAY be something other than a power of 2.  For
>> +convenience, the server MAY advertise a maximum block size that is
>> +larger than the export size, although in that case, the client MUST
>> +treat the export size as the effective maximum block size (as further
>> +constrained by a non-zero offset).
>> +
>> +Where a transmission request can have a non-zero *offset* and/or
>> +*length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
>> +the client MUST ensure that *offset* and *length* are integer
>> +multiples of any advertised minimum block size, and SHOULD use integer
>> +multiples of any advertised preferred block size where possible.  For
>> +those requests, the client SHOULD NOT use a *length* of zero,
> 
> I thought you wanted to ditch the zero length thing? Not sure what
> it adds here given your example re trim of length 0. 0 is an integer
> multiple of anything.

In v2, it was MUST; here, I'm relaxing it to SHOULD (it is no longer a
protocol violation, but not useful, and may cause problems if we later
special-case the meaning of 0 length).  Maybe add a phrase "unless
specific semantics are documented for a *length* of 0, the client SHOULD
NOT use a *length* of zero"?

> 
>> +* `NBD_OPT_BLOCK_SIZE`
>> +
>> +    The client wishes to inform the server of its intent to obey block
>> +    sizes.  The option request has no additional data.
>> +
>> +    The server MUST reply with `NBD_REP_ACK`, after which point the
>> +    client SHOULD use `NBD_OPT_GO` rather than `NBD_OPT_EXPORT_NAME`,
>> +    and the server SHOULD include `NBD_INFO_BLOCK_SIZE` in its reply.
>> +    If successfully negotiated, and the server advertises block sizes,
>> +    the client MUST NOT send unaligned requests.
>> +
>> +    For backwards compatibility, clients SHOULD be prepared to also
>> +    handle `NBD_REP_ERR_UNSUP`, which means the server SHOULD NOT be
>> +    advertising block sizes, and the client MAY assume defaults.
> 
> Could we not just use one of the client flags? We have 32 bits of
> them (as opposed to 16 server flags). It's right at the start
> of the negotiation, and we simply need to refer to whether
> NBD_FLAG_C_BLOCKSIZE is set. It can be in the top 16 bits if we
> don't want to use a flag which can match a server flag.

No: we already document "the server MUST drop the connection if the
client sets an unknown flag, or a flag that does not match something
advertised by the server."  The only way for the client to use a client
flag rather than an NBD_OPT_ is for the server to first burn a global
flag, because otherwise, a new client setting the flag but talking to an
older server would never get past flag negotiation and have no good
error message to show for it.  Whereas with NBD_OPT_, the client is at
least guaranteed a sane NBD_REP_ERR_UNSUP; and we have a lot more room
for adding new NBD_OPT_ (2^16) compared to adding new flags (2^5 global,
2^6 client).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply to:

Follow-Ups:
- Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
  - From: Alex Bligh <alex@...872...>

References:
- [Nbd] [PATCH v3 0/2] block size extension
  - From: Eric Blake <eblake@...696...>
- [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
  - From: Eric Blake <eblake@...696...>
- Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
  - From: Alex Bligh <alex@...872...>

Prev by Date: Re: [Nbd] More efficient treatment of experimental protocol extensions
Next by Date: Re: [Nbd] More efficient treatment of experimental protocol extensions
Previous by thread: Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
Next by thread: Re: [Nbd] [PATCH v3 2/2] doc: Add details on block sizes
Index(es):
- Date
- Thread