Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size

To: Alex Bligh <alex@...872...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
Subject: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
From: Eric Blake <eblake@...696...>
Date: Wed, 13 Apr 2016 09:24:48 -0600
Message-id: <570E64C0.7080909@...696...>
In-reply-to: <0BD53B45-59ED-47D5-A58D-22E6699FCC24@...872...>
References: <1460506614-31604-1-git-send-email-eblake@...696...> <0BD53B45-59ED-47D5-A58D-22E6699FCC24@...872...>

On 04/13/2016 05:04 AM, Alex Bligh wrote:
> Eric,
> 
> On 13 Apr 2016, at 01:16, Eric Blake <eblake@...696...> wrote:
> 
>> Existing NBD servers often have limitations, such as requiring
>> actions to be aligned to block sizes or limiting maximum
>> transactions to avoid denial of service attacks; for example,
>> qemu's NBD server refuses any transaction larger than 32M.  But
>> to date, clients have to learn these limitations via out-of-band
>> means.
>>
>> This alters NBD_OPT_INFO and NBD_OPT_GO to use a new reply type
>> NBD_REP_INFO (rather than overloading NBD_REP_SERVER), so that
>> we have a future-proof way of supplying as much additional
>> structured information about an export as we want.
> 
> ... technically we could send NBD_OPT_INFO and a final
> NBD_OPT_SERVER.

Well, if we reuse NBD_REP_SERVER, we HAVE to send a name, and we're
STILL stuck with the fact that it would be different for NBD_OPT_LIST
than for NBD_OPT_GO.  But the idea of having NBD_OPT_GO return
_multiple_ replies (zero or more NBD_REP_INFO with one item each, and a
mandatory final NBD_REP_FOO for the final transmission flags) is similar
to how NBD_OPT_LIST returns multiple replies, so there's precedent for
that. It also makes it a bit easier so that we're not repeating lengths
or having to check consistency over a list of lengths.

> 
>> Design decision: a client that wants to learn block sizes MUST
>> use NBD_OPT_GO, rather than the old NBD_OPT_EXPORT_NAME, even
>> though we could have repurposed some of the reserved zeroes when
>> NBD_FLAG_C_NO_ZEROES is not in effect, because we don't want to
>> encourage any further abuse of NBD_OPT_EXPORT_NAME.
> 
> Agree with the premise.
> 
> Is there anything to stop a server using NBD_OPT_INFO, then
> NBD_OPT_EXPORTNAME (given the guarantee about things not changing
> between them)? I agree that a server would be well advised for
> every reason under the sun to use NBD_OPT_GO instead if it\s
> already using NBD_OPT_INFO.

Nothing forbidding it in v2, but a client would be stupid to do that (it
no longer benefits from advertised sizing) - maybe I can weasel in a
clause that if the client does not use NBD_OPT_GO, the server MAY
enforce the limits it would have advertised (that is, a server shouldn't
be penalized to have to support a minimum block size of 1 just because
the client is too old to ask for an advertisement).  Or maybe that a
server that wants to support non-default sizing MUST fail
NBD_OPT_EXPORT_NAME (by dropping the connection).

> 
>> +The preferred block size represents the minimum size at which aligned
>> +requests will have efficient I/O, avoiding behaviour such as
>> +read-modify-write.  If advertised, this MUST be a power of 2 at least
>> +as large as the minimum block size, and MUST NOT be larger than
>> +3,355,442 (32M).
> 
> I'm not sure of the origin of the 32M limit or its purpose, but I
> think it's harmless. The reason I'm not sure of its purpose is
> because we already constrain things to be smaller than the maximum
> block size.

I guess a larger preferred block size doesn't hurt, so I could probably
drop the upper limit here - something larger than 32M may upset qemu,
but as long as the maximum size is also equally large, I don't see why a
client and server should be unable to use such large transactions
everywhere.

> 
> I can think of systems with sharded disks where it is conceivable
> they would prefer a larger preferred block size. If this is a problem
> for the client, they can always limit it clientside.

Indeed, using smaller than preferred size won't cause any server errors,
just less than optimal I/O speeds.

> 
>>  The preferred block size MAY be larger than the
>> +export size, in which case the client is unable to utilize the
>> +preferred block size for that export.  If the server does not
>> +advertise a preferred block size, the client MAY assume that a
>> +preferred block size of 65,536 bytes will have reasonable performance.
> 
> Better, "If the server does not advertise a preferred block size,
> the default preferred block size of 65,536 bytes will apply."
> 
> Somewhere else: "If the server does not advertise one or more
> block sizes, and the block sizes explicitly advertised (if any)
> MUST be compatible with the default block sizes used instead of
> those not advertised."
> 
>> +
>> +The maximum block size represents the maximum length that the server
>> +is willing to handle in one request.  If advertised, it MUST be either
>> +an integer multiple of the minimum block size or the value
>> +4,294,967,295
> 
> Probably worth spelling out this is (uint32_t)(-1)
> 
>> for no inherent limit, MUST be at least as large as the
>> +preferred block size, and SHOULD be at least 3,355,442 (32M), but MAY
>> +be something other than a power of 2.  For convenience, the server MAY
>> +advertise a maximum block size that is larger than the export size,
>> +although in that case, the client MUST treat the export size as the
>> +effectual maximum block size.
> 
> s/effectual/effective/
> 
> Shouldn't the effective maximum block size actually be 'export length
> - offset'?

For a given transaction, yes.

> 
>>  If the server does not advertise a
>> +maximum block size, the client SHOULD NOT send requests with a length
>> +larger than 3,355,442 (32M), to minimize the chance of the server
>> +rejecting the request as too large.
> 
> Again, I'd say "a default maximum block size of X shall apply". And
> I'm afraid I think (per Wouter) the number is -1, i.e. no maximum,
> because that's what the current spec is. Even though plenty of us
> have servers that violate this (me included).

That's why I'm trying to write this as 'client SHOULD NOT exceed a sane
size (to avoid tickling buggy servers), while servers MUST support
unlimited if they did not advertise (new servers will therefore always
advertise)', where the restriction on the client is on the low end but
the client can risk going larger without violating spec, and where the
restriction on the server is on the high end and the server is buggy for
going lower without telling the client.

>>
>> +Where a transmission request can have a nonzero
> 
> s/nonzero/non-zero/ ?
> 
>> *offset* and/or
>> +*length* (such as `NBD_CMD_READ`, `NBD_CMD_WRITE`, or `NBD_CMD_TRIM`),
>> +the client MUST ensure that *offset* and *length* are integer
>> +multiples of any advertised minimum block size, and SHOULD use integer
>> +multiples of any advertised preferred block size where possible.  For
>> +those requests, the client MUST NOT use a *length* of zero, and MUST
>> +NOT use a *length* larger than any advertised maximum block size or
>> +which, when added to *offset*, would exceed the export size.  The
>> +server SHOULD
> 
> MAY? I see no particular harm in the server handling it. Especially if
> (e.g. on TRIM) it handles it through a NOP.

It may be easy to do a no-op for read, write, and trim on zero length,
but it's also easier to state that the client should not rely on zero
length.  Also, forbidding it makes it possible to later implement
extensions (such as a size == 0 for trim means "trim the entire device
at once, even if the export size is > 4g").

> 
>> report an `EINVAL` error if the client's request is not
>> +aligned to advertised minimum block size boundaries, or is larger than
>> +the advertised maximum block size, although the server MAY instead
>> +disconnect if a large *length* could be deemed as a denial of service
>> +attack.

The qemu behavior of rejecting transactions larger than 32M is in the
category of DoS prevention, which is why I think it is worth documenting
a sane limit that clients SHOULD (not MUST) obey in absence of any more
precise limit from the server.

>>
>> The server SHOULD return `ENOSPC` if it receives a write request
>> -including one or more sectors beyond the size of the device.  It SHOULD
>> -return `EINVAL` if it receives a read or trim request including one or
>> -more sectors beyond the size of the device.  It also SHOULD map the
>> -`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
>> -`EPERM` if it receives a write or trim request on a read-only export.
>> +including one or more sectors beyond the size of the device.
> 
> TBH I think `EINVAL` for all offset/length errors would be more appropriate.

ENOSPC was mandatory for over-length errors prior to my patch, and
remains useful.  Unlike POSIX files, which automatically grow, NBD does
not allow for volume resizing (and I don't know if we'll ever want to
add an extension supporting that).  But ENOSPC is a useful error for
controlling thin-provisioning where volumes _can_ be extended on the
fly, so that a client can tell the difference between "my write failed,
but may succeed if I resize and try again" vs. "my write failed because
I did it wrong or because the device is permanently hosed".

What I'm adding here was the EINVAL on alignment errors (since size
errors were already specified), as well as the chance for EINVAL in the
case where maximum block size < export size but where the server wants
to reply with an error rather than disconnect.

> 
>>  It
>> +SHOULD return `EINVAL` if it receives a read or trim request including
>> +one or more sectors beyond the size of the device, or if a read or
>> +write request is not aligned to advertised minimum block sizes. It
>> +also SHOULD map the `EDQUOT` and `EFBIG` errors to `ENOSPC`.
> 
> These error codes seem very linux / implementation specific. Also
> I find a read returning an ENOSPC error a bit weird. lseek, readv
> etc. all return EINVAL in such circumstances. ENOSPC is normally
> for operations which would take up more space.

Pre-existing, I didn't change it.

> 
>>  Finally,
>> +it SHOULD return `EPERM` if it receives a write or trim request on a
>> +read-only export.
> 
> s/SHOULD/MUST/?

Pre-existing, I didn't change it.

> 
>>
>> The server SHOULD return `EINVAL` if it receives an unknown command.
> 
> hmmm, then again, EINVAL seems overloaded.

In general, a client that gets an EINVAL should be considered a client
bug ("you should have known better to not pass me invalid arguments") -
yes, we use it in a lot of situations, but I think we are still being
consistent about it being something that the client should be able to
completely avoid.

>> +* `NBD_REP_INFO`
>> +
>> +    A detailed description of an export, subdivided into one or more
>> +    items of information, used by `NBD_OPT_INFO` and `NBD_OPT_GO` on
>> +    success, and which provides a superset of the information given in
>> +    reply to `NBD_OPT_EXPORT_NAME`.  The header *length* MUST be at
>> +    least 14, and MUST match the length visited by reading items in
>> +    turn (that is, 4 * number of items + sum of *info length* from
>> +    each item).
> 
> It would be easier to program this if there wasn't a total length,
> just individual items, and the last one was a NBD_INFO_DONE sentinel,
> or perhaps had a bit set on its type. Otherwise you need to find the
> length of ALL your strings etc. in advance, which is painful. A
> client is going to want to skip these one by one anyway (as it
> needs to interpret them to see whether it understands each one)
> so a length at the start gives no advantage.

In other words, make the reply to NBD_OPT_GO be a series of NBD_REP_
followed by a final sentinel reply, similar to how NBD_OPT_LIST works.
I can try that for v3.

>> +    The following info types are defined:
>> +
>> +    * `NBD_INFO_SIZE_FLAG` (0)
>> +
>> +      The *info length* MUST be 10, and represents the same
>> +      information as given in reply to `NBD_OPT_EXPORT_NAME` as if
>> +      `NBD_FLAG_C_NO_ZEROES` were negotiated.
>> +
>> +      - 64 bits, size of the export in bytes (unsigned)  
>> +      - 16 bits, transmission flags
> 
> I vote for breaking these up into two separate options. In the
> current nbd-server implementation, finding the export length
> is "hard", and finding the transmission flags easy(er).
> There seems no reason to keep them together.
> 
> Should sending transmission flags be compulsory?

During NBD_OPT_INFO, it makes sense to be compulsory because it is the
way the server tells the client whether the export is read-only.  During
NBD_OPT_GO, I think it MUST be provided, because it is replacing the
information that NBD_OPT_EXPORT_NAME provided.

> 
> Should sending export size be compulsory?

During NBD_OPT_INFO, it could perhaps be optional, but for NBD_OPT_GO,
it is mandatory because it replaces NBD_OPT_EXPORT_NAME.  But then we
have the wording about NBD_OPT_INFO followed by NBD_OPT_GO being
required to return the same bits of information, so it's easier to make
it mandatory in both cases.  And layout-wise, it's easier if the final
10 bytes for either NBD_OPT_GO or for NBD_OPT_EXPORT_NAME are identical.

>> +    * `NBD_INFO_MAXIMUM_BLOCK` (4)
>> +
>> +      The *info length* MUST be 4, and represents the maximum block
>> +      size.  See the "Block sizes" section for further requirements on
>> +      its value.
>> +
>> +      - 32 bits, maximum block size  
> 
> I like these.

Is it worth keeping the three separate (where the server can advertise
one, but not all three, and the others fall back to defaults), or is it
easier to just always require all three to be present (since we've
documented sane defaults, a server should always be able to supply
something)?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply to:

Follow-Ups:
- Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
  - From: Alex Bligh <alex@...872...>

References:
- [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
  - From: Eric Blake <eblake@...696...>
- Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
  - From: Alex Bligh <alex@...872...>

Prev by Date: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Next by Date: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Previous by thread: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Next by thread: Re: [Nbd] [PATCH v2] doc: Add new NBD_REP_INFO reply, for advertising block size
Index(es):
- Date
- Thread