[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Alex Bligh <alex@...872...> writes:

> Goswin,
>
> --On 28 May 2011 16:37:12 +0200 Goswin von Brederlow
> <goswin-v-b@...186...> wrote:
>
>> 2) Overlapping requests
>>
>> I assume that requests may overlap. For example a client may write a
>> block of data and read it again before the write was ACKed. This would
>> be unexpected behaviour from a proper client but not forbidden.
>
> Correct
>
>> As such
>> the server has to internally ensure the proper order of overlapping
>> requests.
>
> Slightly surprisingly, the fsdevel folk's answer to this is that you
> can disorder both reads and writes and do what is natural, i.e. do
> not maintain ordering. A file system which cares about the result
> should not issue reads of blocks for which the writes have not
> completed.

I guess this makes sense if you think of the behaviour with multiple
cpus and threads. The threads might invoke read/write calls at the same
time. Allowing disorder means that the requests can be processed in
parallel through all the layers withough having to synchronize between
cpus.

Wouter: Could we make a decision here about the behaviour of a correct
nbd-server in this? Must it logically preserve the order or read/write
requests (i.e. return the value you would get if it had been done in
order) or can it implement the disordered behaviour that linux seems to
allow?

The later would be much simpler code wise.

>> 3) Timing of replies and behaviour details
>>
>> Now this is the big one. When should the server reply to a request and
>> how should it behave in detail? Most important is the barrier question
>> on FUA/FLUSH.
>>
>> * NBD_CMD_READ: reply when it has the data, no choice there
>
> Technically you need not reply as soon as you have data, but you
> can't reply before.

True. And a read reply takes times (lots of data to send). In case there
are multiple replies pending it would make sense to order them so that
FUA/FLUSH get priority I think. After that I think all read replies
should go out in order of their request (oldest first) and write replies
last. Reason being that something will be waiting for the read while the
writes are likely cached. On the other hand write replies are tiny and
sending them first gets them out of the way and clears up dirty pages on
the client side faster. That might be beneficial too.

What do you think?

>> * NBD_CMD_WRITE with NBD_FLAG_SEND_FUA:
>>   + NBD_CMD_FLAG_FUA:
>>     reply when data is on physical medium (sync_file_range(),
>>     fdatasync() or fsync() returned)
>
> (don't use sync_file_range() - see the comment in the source)

I don't get the bit about

   *  a) have a volatile write cache in your disk (e.g. any normal SATA disk)

Isn't that a major bug in sync_file_range then? How does f(data)sync()
ensure data is on physical disk while sync_file_range returns when it
only transmitted to the disk?

Is that because f(data)sync() will cause a FLUSH+FUA pair on the
underlying FS (or FLUSH on a device)?

>>     Does this act as a barrier? Should the server stop processing
>>     further requests until the FUA has completed?
>
> No, it should not act as a barrier. You may disorder requests across
> FUA. No you do not need to stop processing requests.
>
> See Documentation/block/writeback_cache_control.txt in the linux
> kernel sources.
>
>>   + not NBD_CMD_FLAG_FUA:
>>     a) reply when the data has been recieved
>>     b) reply when the data has been commited to cache (write() returned)
>>     c) reply when the data has been commited to physical medium
>
> You may do any of those. Provided you will write the data "eventually"
> (i.e. when you receive a REQ_FLUSH or a disconnect).
>
>>     For a+b how does one report write errors that only appear after
>>     the reply? Report them in the next FLUSH request?
>
> You don't. To be safe, I'd error every write (i.e. turn the medium
> read only).

Why not return EIO on the next FLUSH? If I return success on the next
FLUSH that would make the client thing the write has successfully
migrated to the physical medium. Which would not be true.

And turning the device read-only seems like a bad idea. How would
badblocks be able to work then?

>>     The current behaviour is b I think
>
> correct, but this is indistinguishable from a client point of view from
> (a).

It makes a difference if the nbd-server process dies but the system
itself doesn't crash. With b the data will still end up on disk. Slight
difference. But I guess in both cases the power can fail and loose the
data and clients have to assume that might happen.

>>     unless the server runs with the
>>     sync option and then it is c. Is option a valid? In COW mode there
>>     would be little sense in waiting for the write to complete since all
>>     data is lost on reconnect anyway and this might be a tick faster.
>
> CoW+synchronous writes are in general not useful unless you have
> something else examining the CoW file.
>
>> * NBD_CMD_WRITE without NBD_FLAG_SEND_FUA:
>>   I assume the behaviour is the same as write with NBD_CMD_FLAG_FUA not
>>   set in the above case.
>
> The semantics are the same as if you had set NBD_FLAG_SEND_FUA; no
> incoming requests will have the FUA bit set.
>
>>   [I wonder if there should be a NBD_FLAG_SYNC although that would be
>>   identical to the client setting NBD_CMD_FLAG_FUA on all writes.]
>
> There's not much point the server asking the client to set a bit on
> every request, because it can set it itself; it's /practically/ identical
> to FUA on everything (I think there may be a difference in metadata,
> i.e. syncing mtime).

What happens if the client mounts a filesystem with sync option? Does
nbd then get every request with FUA set?

It might make sense to implement a NBD_FLAG_SYNC option the nbd-client
can set in the handshake to force an export into sync mode.

>>   I'm considering haveing an --auto-sync mode that runs in sync mode
>>   when NBD_FLAG_SEND_FUA and NBD_FLAG_SEND_FLUSH is not used. This would
>>   be slow for old clients and fast for new ones but ensure data
>>   integrity for both. Maybe something nbd-server could use too.
>
> Note that even if you set NBD_FLAG_SEND_{FUA,FLUSH} you won't actually
> get FUA / FLUSH unless:
>
> a) the client supports it (because the client has to do a SETFLAGS
>   ioctl)

No. The --auto-sync mode would take care of exactly that. If the client
does not supports FUA/FLUSH then sync would be activated.

> b) the kernel supports it

This the client should check before telling the server it will do
FUA/FLUSH.

> c) the filing system supports it.
>
> Because of (c), you won't get any on ext2, or (depending on distribution)
> ext3, unless you mount with -obarriers=1 (some distros do this by default).
> ext4 will send them by default, as does btrfs and xfs.

Nothing one can do there. If the FS doesn't work safe then it won't get
safe behaviour from the nbd-server. Same as if it had a physical disk. I
just don't want to make it less safe.

> Therefore, even if you know (a) and (b) are satisfied, you should
> at least attempt to be safe if you never receive a single flush.

Hmm, maybe work in sync mode until the first FUSH is recieved? That
might actually be better.

>> * NBD_CMD_DISC: Wait for all pending requests to finish, close socket
>
> You should reply to all pending requests prior to closing the socket
> I believe, mostly as it's polite. I believe the current client doesn't
> send a disconnect until all replies are in, and I also think the server
> may behave a little badly here.

finished == ACK send.

>>   Should this flush data before closing the socket? And if so what if
>>   there is an error on flush? I guess clients should send NBD_CMD_FLUSH
>>   prior to NBD_CMD_DISC if they care.
>
> No, you should not rely on this happening. Even umount of an ext2 volume
> will not send NBD_FLUSH where kernel, client, and server support it.
> You don't need to write it then and there (in fact there is no 'then
> and there' as an NBD_CMD_DISC has no reply), but you cannot guarantee
> *at all* that you will have received any sort of flush under any
> circumstances.

I ment that if the client cares about the data being written before a
disconnect it has to explicitly flush. If it doesn't flush then that is
his problem.

>>   What if there are more requests after this while waiting for pending
>>   requests to finish? Should they be ignored or return an error?
>
> I believe it is an, um, undocumented implicit assumption that no
> commands are sent after NBD_CMD_DISC is sent. The current server
> just closes the socket, which will probably result in an EPIPE
> upstream if the FIN packet gets back before these other commands
> are written.

I will go with ignore then, for the case that any request might arive
after NBD_CMD_DISC but before the reading side of the socket is shut
down. And close the writing side of the socket once all pending requests
have been ACKed.

>> * NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
>>   (sync_file_range() on the whole file, fdatasync() or fsync() returned).
>
> You only need to wait until any writes that you have sent replies
> for have been flushed to disk. It may be easier to write ore than than
> that (which is fine). Whilst you do not *have* to flush any commands
> issued (but not completed) before the REQ_FLUSH, Jan Kara says "*please*
> don't do that".

Urgs. Assuming one doesn't flush commands that where issued but not yet
completed. How then should the client force those to disk? Sleep until
they happen to be ACKed and only then flush?

I agree with Jan kara there. Please lets not do that. Lets document (and
implement) that a FLUSH command on nbd means all prior requests are to
be flushed and not just those already ACKed.

>>   I assume any delayed write errors should be returned here.
>
> No. There is really no such thing as a "delayed write error". Consider
> what happens with the current server. We write to the block cache.
> That hits physical disk some time later. If the physical disk errors,
> the normal fs response is for the disk to go read-only. You are right
> that flush will then (probably) error, but so will every other write
> command. Basically, you are stuffed if you get an error.

NBD is a block device. So you have to look at what a local block device
will do. From userspace I get the following:

write() -> data goes into cache, returns OK
fsync() -> data goes to the physical medium, returns EIO on error

Saddly the chapter on the Block IO Layer in "Linux Kernel Developement"
is rather sparse with only 15 pages and half of them describing I/O
schedulers. So please excuse my ignorance.

How do filesystems even get the block number of a physical disk error?
My understanding is that a bio write goes into the disks cache and
completes. And later a REQ_FLUSH is send to the disk to force the data
onto the platter. Doesn't the disk only report an error on that
REQ_FLUSH when it can't write a block?

>>   Does this act as a barrier? Should the server stop processing
>>   further requests until the FLUSH has completed?
>
> No, REQ_FLUSH is not (in general) a barrier. You can disorder writes
> across it (i.e. you can start and complete writes issued after
> the REQ_FLUSH before you have completed the REQ_FLUSH, *and* you
> can avoid completing writes issued before the REQ_FLUSH when you
> complete the REQ_FLUSH - save for Jan's imprecations above). There
> are no longer barriers within the request system since they got
> ripped out a while ago.
>
>>   When using write()implementing this as barrier is probably
>>   easiest. But using async writes (libaio or Posix aio) one would just
>>   insert an aio_fsync() into the queue and reply on completion.
>
> There's no harm implementing it as a barrier, but it's more than
> is necessary. Assuming your writes are all written to a file
> when completed, all you need to do is fsync().

What is the behaviour when doing an fsync() in one thread followed by a
write() in a second thread (while fsync() still runs)?

Say you have:

Thread 1                Thread 2
                        write 0 to block
                        fsync
write 1 to block
fsync
                        write 2 to block
fsync returns

Is it garantied that block now contains 1 or 2 or could the last write
cancel the second as unneccessary but not sync the thrid? Could one end
up with block still containing 0? I'm assuming it can't but lets be
sure.

> For more detail, read this thread:
>  http://www.spinics.net/lists/linux-fsdevel/msg45584.html
> as I appear to have asked the same questions as you (there was
> a bit of off-list stuff too).
>
> In this I am assuming that REQ_FLUSH and REQ_FUA semantics are defined
> by the linux kernel semantics for them. I think I can safely assume
> that as it was me who added them so I guess I get to decide :-)

Yes, that is my assumption too.

MfG
        Goswin



Reply to: