[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] [PATCH] nbd: Support FUA, FLUSH and ROTATIONAL



Paul,

--On 2 August 2011 11:58:11 -0400 Paul Clements <paul.clements@...856...> wrote:

Finally, I've gotten around to this...my comments below...

Thanks for looking at this - replies inline below.

if
(nbd_cmd(req) == NBD_CMD_FLUSH) {
+               request.from = 0;
+               request.len = 0;

I don't believe we can assume that a FLUSH always has zero length. At
least in the block layer sources it says it is valid to have a FLUSH
and/or FUA with a single bio attached. I think nbd needs to handle
that.

On bio based drivers this is true, but I think we can make
this assumption on request based drivers. From the documentation
at
 Documentation/block/writeback_cache_control.txt

| Note that
| REQ_FLUSH requests with a payload are automatically turned into a sequence
| of an empty REQ_FLUSH request followed by the actual write by the block
| layer.

Confusingly, as far as I can make out, the documentation calls bio
based drivers "make_request_function" and request based drivers
"request_fn" based drivers. I believe I verified this point with
Christoph Hellwig on fs-devel but I may be wrong.

The same thing seems to be documented at the top of blk-flush.c
in the sources:

|  * If the device has writeback cache and supports FUA, REQ_FLUSH is
|  * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
|  *
|  * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
|  * translated to PREFLUSH and REQ_FUA to POSTFLUSH.

I am far from an expert in the code, but I think it is
a blk_insert_flush which clears REQ_FLUSH.

rotational can already be toggled by the block layer sysfs entries...I
don't know that we need an nbd-specific way to do this...

Is this a "why do we need a negotiation" question or a "why do
we change the flag this way" question? The server is saying
"my backing device is rotational or nonrotational",
and the protocol is thus being used to set the flag correctly on
the client, so the server can't set the flag directly (wrong machine).

I think you may be suggesting that the client userspace code
should set the relevant sysfs entry at the time? I had assumed
this was read-only but having just tested, it's not. I see
no issue with doing it this way. Indeed this is superior because
it means the rotational attribute can be supported in the
client without a kernel change.

       case NBD_SET_SIZE_BLOCKS:
               lo->bytesize = ((u64) arg) * lo->blksize;
               bdev->bd_inode->i_size = lo->bytesize;
@@ -775,6 +820,11 @@ static int __init nbd_init(void)
                       put_disk(disk);
                       goto out;
               }
+               /* In order not to confuse the elevator, say we
always +                * want FLUSH and FUA. We won't send them
to the server +                * unless the relevant flag bit is
set
+                */
+               blk_queue_flush(disk->queue, REQ_FLUSH | REQ_FUA);

Not sure this is kosher either.

The reason for this bit of strangeness is that I did not know when
it is legal to change the REQ_FLUSH and REQ_FUA bits using
blk_queue_flush. The negotiation (and thus the ioctl) comes later,
and /conceivably/ there might already be data in the elevator, and
I didn't know the effect of calling blk_queue_flush (which confusingly
changes flushing strategy rather than flushing the queue) if there
was already data in the elevator. It might be that we can guarantee
there is no data in the elevator by then (somehow) in which case
the easy thing would be to call blk_queue_flush only at negotiation
time.

This seems like the best way to fix this if what you say below
is true. If I had complete freedom I'd set the queue strategy
by default somewhere earlier on (rather than in the SET_SIZE_BLOCKS
ioctl call, which is a bit strange), then the queue strategy
when SET_FLAGS is called (or a new sysfs entry as per below).

If you say that you implement FUA at
least, you have to honor the FUA requests.

OK. I'm not saying you are wrong here but the impression Christoph
gave me was that filing systems blindly ask for flush and fua, and
it's up to the block layer whether they implement them, rather than
them acting on the information and changing the order of what they
send. This sort of fits as without flush and fua, the bios can
be completely disordered by the driver anyway (so why not code
your file system to send stuff as you'd want to if flush and
fua are available?).

Otherwise, you're
invalidating the assumptions that filesystems make about data
consistency. I think we could eliminate the FUA handling bit, since
we're not really doing FUA (the nbd-server is just doing a postflush).

Only because of the way it is implemented at the moment. According
to Christoph, the way to do FUA properly is to have a second fd
open with O_DIRECT set, and pass FUA requests to that. I just haven't
coded that yet. It didn't seem worth it unless and until I could
get changes into the kernel.

Also, I have another server that does implement FUA properly :-)

It looks like we basically get the same behavior for less work if we
just don't say that nbd handles FUA and let the block layer convert
the FUA to a postflush. If you turn off FUA do you get 2 flushes
instead?

Christoph's answer to the danger of 2 flushes was "the second
flush will do nothing anyway". But I would still like to support
FUA if possible.

The flag changes look good. I'll go ahead and start testing with those.

Thanks.

One question to the general audience: do we need an ioctl for the flag
setting or would a sysfs entry be sufficient? I really would like to
move away from having so many ioctls in nbd. I've started converting
some debug and ioctl stuff to sysfs already...

My 2 penniworth here is that I'd be fine to use an additional sysfs
entry (or a pair, one for FLUSH and one for FUA) and use the existing
sysfs entry for rotational. However, if we have a well defined flags
word, a single ioctl to carry all the flags possible does not seem
like an enormous amount of ioctl wastage. A more general issue is
that we have poor negotiation of capabilities (other than bits)
between server and client; if we are sending values other than bits,
it may be that an ioctl is suboptimal.

Currently I am using the new ioctl for ROTATIONAL (which can go)
NBD_FLAG_READ_ONLY (which is broken in current kernel implementations,
fixed with my patch but could be fixed another way), and
the FUA/FLUSH stuff (which is non-existent in current kernels).
Sadly I can't seem to get the /sys entry for "ro" to be writeable:

# cat /sys/block/nbd9/ro
0
# echo 1 > /sys/block/nbd9/ro
bash: /sys/block/nbd9/ro: Permission denied

so we need some way to pass through to the kernel the "readonly"
flag I think, at least (as far as I can tell it was broken before
my patch). That could be an nbd specific /sys entry, of course.

The main reason for doing an ioctl is I have no idea how to do
per-disk driver dependent sys entries! This is probably not a good
reason.

--
Alex Bligh



Reply to: