Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension

To: Eric Blake <eblake@...696...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
Subject: Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension
From: Alex Bligh <alex@...872...>
Date: Tue, 12 Apr 2016 15:39:41 +0100
Message-id: <59520F61-394C-48AB-8F6A-958DBFDEABC5@...872...>
In-reply-to: <570D04BC.8020900@...696...>
References: <1460433845-7865-1-git-send-email-eblake@...696...> <017B7506-C552-4ABD-B3D4-DE9C892B05CC@...872...> <570D04BC.8020900@...696...>

On 12 Apr 2016, at 15:22, Eric Blake <eblake@...696...> wrote:

> In other words, you're proposing that we limit the advertisement of
> block size ONLY via NBD_CMD_INFO/GO, so that gives people a reason to
> upgrade. I can buy that, if Wouter likes it.

Yes

> I think it's probably time to promote NBD_CMD_INFO/GO to stable, since
> we have several implementations coded up (I'm too late for qemu 2.6 to
> contain it, but qemu 2.7 will have the code merged in as soon as 2.6 is
> released) - you already mentioned that you looked at what it would take
> to get the reference implementation to also support it (with the tricky
> part being how to get the export size).

Save that I'm proposing extending it :-) But it should be a future-proof
extension.

>> Perhaps instead of the name (before the other bit that looks
>> like NBD_OPT_EXPORT_NAME's reply), we should *now* specify
>> a structured selection of export information, e.g.
>> 
>> 0000: 32 bits: information element length
>> 0004: 32 bits: information element type
>> 0008 ... length: information element data
> 
> If we're going to send something other than the name, then it is no
> longer NBD_REP_SERVER, and we'd be better off inventing a new NBD_REP_.
> But if we keep it NBD_REP_SERVER (start with name like always, and end
> with transmission flags like NBD_OPT_EXPORT_NAME), we can still make the
> middle as structured as we want, as in this example:
> 
> S: 64 bits magic (0x3e889045565a9)
> S: 32 bits NBD_OPT_INFO (6)
> S: 32 bits NBD_REP_SERVER (2)
> S: 32 bits length
> S: 32 bits name length (3)
> S: name length bytes ("foo")
> S: 16 bits element type NBD_INFO_BLOCK_SIZE (1)
> S: 16 bits element length (12)
> S: 32 bits minimum block size
> S: 32 bits preferred block size
> S: 32 bits maximum block size
> S: 16 bits element type NBD_INFO_...
> S: 16 bits element length ...
> S: length bytes for next element...
> S: 16 bits element type NBD_INFO_END (0)
> S: 64 bits export size
> S: 16 bits transmission flags
> 
> where we can have as many additional information elements defined in the
> middle, and where the client doesn't have to negotiate anything (the
> server just goes ahead and provides everything that it knows, and the
> client is free to use the element types it knows and ignore via the
> length the ones it does not, until reaching NBD_INFO_END).

Yeah I could buy that. I forgot NBD_REP_SERVER is used by NBD_OPT_LIST
too.

>> We can then put what we like in there, and we have a future-proof
>> way of sending mount information.
>> 
>> The name then can be the first of these, i.e.
>> 
>> Element type 0001, Data: length of name (32 bits, followed by name).
>> 
>> We can then transmit the other data as an element type.
> 
> If we reuse NBD_REP_SERVER, then name can't be any different from how
> NBD_OPT_LIST uses it. Otherwise, we want a new NBD_REP_ type (but then,
> we could even make the name optional, as for NBD_OPT_INFO it is only
> useful information if the export has a canonical name that differs from
> the name the user asked about).

Agree.

>>> +
>>> +The minimum block size MUST be a power of 2, and represents the
>>> +smallest addressable length and alignment within the export, even if
>>> +writing a block that small requires the server to use a less-efficient
>>> +read-modify-write action.  This value MAY be as small as 1 if the
>>> +export is backed by a regular file, although the values of 512 or 4096
>>> +are more typical for an export backed by a block device.  When block
>>> +sizes are negotiated, the server MUST return an export size that is an
>>> +integer multiple of the minimum block size.
>> 
>> I think you should make it clear that ALL requests MUST have an
>> offset AND a length which are multiples of the minimum blocksize.
>> Ah, you handle this below, but it might be worth saying here
>> what this means.
>> 
>> +1 on export size.
>> 
>> Should this not be larger than 32M?
> 
> True. In fact, minimum block size should probably not be larger than
> 32k,

Hah yes.

> since the NBD_CMD_FLAG_DF says that we only guarantee that the
> server won't fragment at 32k or smaller, but at the same time, the
> server should not be fragmenting smaller than minimum block size.  If
> the minimum block size is greater than 32k, then the client can't
> usefully set NBD_CMD_FLAG_DF, but also can't usefully guarantee the
> result will be unfragmented.  And while 4k physical blocks are becoming
> more prominent, I seriously doubt we have block devices where minimum
> block size larger than 32k are going to be in heavy use any time soon.
> 
> I still strongly think that export size MUST be an integer multiple of
> minimum block size (otherwise, you have the funky situation of the tail
> of the file being inaccessible via NBD) - perhaps this means we should
> add a statement that if an NBD server will be visiting a file that is
> not a multiple of the minimum block size it is willing to let the client
> use, then the server either has to truncate the export size down (the
> client can't access the last few bytes of the file), or round it up (the
> last sector of the file will have a tail that always reads as zeroes and
> where writes into the tail are ignored), so that the client always sees
> an export size that is a sane multiple.

I agree with that and wasn't saying otherwise. I'm saying it should
present an export size which is an integer multiple of the minimum block
size. How it does this is really up to it. We don't care if it writes
the blocks beneath the wings of carrier pigeons.

>> Obviously they can't send offsets or lengths greater than the export
>> size, but that's handled elsewhere. In practical terms it's
>> easier to get this information from the driver, and leave the export
>> size out of it.
> 
> Fair enough - so the wording should be tweaked to state that preferred
> and maximum sizes MAY be larger than export size, at which point export
> size becomes the effectual limit. I'm also starting to think I should
> add block sizes as a separate sub-heading up front, particularly if we
> DON'T make it possible to get export sizes except by NBD_OPT_GO (that
> is, older clients using NBD_OPT_EXPORT_NAME should still read the text
> about the client SHOULD NOT assume larger than 32M maximum transactions
> without word from the server, because they will be unable to get word
> from the server).

Separating it is not a bad idea.

However, older clients will need to change to get the information anyway,
so I'm not worried about how they change.

As to mandating a 32M maximum transaction for all clients/servers
which do not support the extension, I'm not sure that's reasonable.
There's no maximum now. Technically qemu is 'out of spec' here (though
for entirely sensible reasons). You said it yourself above, that the
limit has to be communicated out of band without the extension.

Not too fussed though.

--
Alex Bligh

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply to:

References:
- [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension
  - From: Eric Blake <eblake@...696...>
- Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension
  - From: Alex Bligh <alex@...872...>
- Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension
  - From: Eric Blake <eblake@...696...>

Prev by Date: Re: [Nbd] [PATCHv2 6/6] Add TLS testing to nbd-tester-client.c
Next by Date: Re: [Nbd] [PATCHv2 5/6] Add TLS support to server
Previous by thread: Re: [Nbd] [RFC PATCH] doc: Add new NBD_FLAG_BLOCK_SIZE extension
Next by thread: [Nbd] [PATCHv9] Improve documentation for TLS
Index(es):
- Date
- Thread