[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Allowing > 32 bit lengths for NBD_CMD_TRIM, NBD_CMD_WRITE_ZEROES

Hi Rich,

On Tue, Sep 04, 2018 at 09:47:13AM +0100, Richard W.M. Jones wrote:
> If you create a very large XFS filesystem, XFS issues a single discard
> request over the whole filesystem first.  However NBD translates this
> into many 4 GB NBD_CMD_TRIM requests over the wire.
> I've been trying to create an 8 EB (2^63 - 1024) XFS filesystem to
> explore the limits of the Linux kernel, XFS, NBD, nbdkit.  The discard
> step issues 2 billion x 4 GB NBD_CMD_TRIM requests which takes ...
> quite a while (actually too long, I had to turn this feature off in
> mkfs.xfs).
> Since NBD_CMD_TRIM & NBD_CMD_WRITE_ZEROES don't involve sending data
> over the wire, it seems we should allow larger lengths for these
> commands as an extension.

Right. This was brought up before, but I think it got burrowed in
various other things we were discussing at the time.

> I could think of a few ways this might be implemented:
> (1) At option negotiation time, negotiate a "64 bit lengths" extension
> and change the format of the request packet.  I don't like this: it
> would break a lot of packet analysis tools and also makes parsing
> commands in the server more difficult.
> (2) Add a command flag NBD_CMD_FLAG_64_BIT_LENGTH.  When this is
> present, the high order 32 bits of the length are sent by the client
> after the normal request packet.  This raises the question of if we
> should allow this for NBD_CMD_WRITE/NBD_CMD_READ too.

Currently, the protocol message format is the same for *all* messages.
In my opinion, that's a feature, since it makes reading protocol
messages simple; the code that reads a message from the wire only needs
to know whether data will follow, it doesn't need to know anything else
about a message. For nbd-server, that's in mainloop_threaded
(nbd-server.c:2745 -- yes that needs to be split up ;-)

This proposal (and (3) below, too) breaks that; the code which reads a
message from the wire would need to know a lot more about each message

Something similar which I suggested earlier and which wouldn't break
that assumption was a flag which would shift the result: if the flag is
set, you send the high-order bits, and we lose the low-order ones. That
would mean that if you want to send a request for something that is not
32-bit aligned, you may need to send up to three commands: one without
the flag for the area up to the next multiple of 2^32, one with the flag
for "most" of the area, and then a final one for the area from the last
multiple of 2^32 in your range up to the end of the range. 3 is much
better than 2 billion though.

I'm not a huge fan of this, because it's a stopgap measure rather than a
proper fix, but it would work, and it wouldn't need to change messages
incompatibly. As a result it would also be much easier to implement (I
*still* haven't gotten around to implementing structured replies...)

> (3) Add a new set of commands (NBD_CMD_TRIM64 etc) which work exactly
> like the existing commands except the high order 32 bits of the length
> are sent after the normal request packet.

Same here. Additionally, there's really little difference between having
an NBD_CMD_TRIM with a SIZE_64 flag and a separate NBD_CMD_TRIM64
command (indeed, originally the "flag" and "command" fields were one and
the same). While I'm no fan of either, if forced to choose I'd lean more
towards the former, but I don't have a very strong opinion on that.

> (4) Allow a different request message format, differentiated by using
> a different magic (! NBD_REQUEST_MAGIC).  We could use the opportunity
> to widen a few fields and reserve space for future expansion.
> Similar to (1).

I was originally opposed to this, but have since come to realize that
that (being opposed) may have been a mistake. We probably should have
used the opportunity to change the request format when we changed the
response format, too.

It's not too late to introduce a request format change, though. In that
case, I would suggest:

- magic (32 bit value): different magic number, indeed.
- request length (32 bit value): length of header and data (I don't
  think it makes sense to have requests that are >4GiB, but could be
  persuaded otherwise).
- header length (32 bit value). 8 bits should be more than enough, but
  for alignment reasons 32 bits is probably nicer. Or we can say "3
  reserved bytes followed by 8 bytes of header length". Or maybe the
  other way around, so the reserved bytes are immediately preceding the
- flags (16 bit value): same as today
- command (16 bit value): same as today
- from (64 bits): same as today
- affected length (64 bits): For a write command, affected length should
  be equal to request length - header length. For other commands,
  affected length should just signal the length of the desired response
  (for read) or of the desired change (for trim, zero, etc).

So a request to zero out 2^48 bits of data would be:

- 32 bits magic
- request length: (3*4 + 2*2 + 2*8) = 32
- header length: same, 32
- flags: probably emtpy
- command: NBD_CMD_WRITE_ZEROES (4)
- from: (whatever offset)
- affected length: 2^48

Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Reply to: