Re: [Nbd] Question about the expected behaviour of nbd-server for async ops

To: Alex Bligh <alex@...872...>
Cc: nbd-general@lists.sourceforge.net
Subject: Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
From: Goswin von Brederlow <goswin-v-b@...186...>
Date: Sun, 29 May 2011 14:53:01 +0200
Message-id: <87fwnxy8gi.fsf@...860...>
In-reply-to: <0F0AB0F66B196FBBE86999A5@...874...> (Alex Bligh's message of "Sun, 29 May 2011 11:45:18 +0100")
References: <87oc2m28o7.fsf@...860...> <6DBA2A6208847F844397DB62@...873...> <87wrh9bz0c.fsf@...860...> <0F0AB0F66B196FBBE86999A5@...874...>

Alex Bligh <alex@...872...> writes:

> Goswin,
>
>> Wouter: Could we make a decision here about the behaviour of a correct
>> nbd-server in this? Must it logically preserve the order or read/write
>> requests (i.e. return the value you would get if it had been done in
>> order) or can it implement the disordered behaviour that linux seems to
>> allow?
>>
>> The later would be much simpler code wise.
>
> I think we should probably involve the block layer people in this as
> whilst the ability to disorder a read to go before a write appears
> to be technically allowed, I am under the impression it hadn't
> received a lot of testing. Perhaps the answer is that we simply
> don't specify it in the protocol, but say "you must do what the
> linux block layer expects".

That really sucks documentation wise. Because then you have to start a
hunt for further documentation which probably doesn't even exist other
than the source.

It is ok to say we implement what the linux block layer expects but it
should be spelled out in the text or at least name a file to look at for
the details. It should not be left this vague.

>> True. And a read reply takes times (lots of data to send). In case there
>> are multiple replies pending it would make sense to order them so that
>> FUA/FLUSH get priority I think. After that I think all read replies
>> should go out in order of their request (oldest first) and write replies
>> last. Reason being that something will be waiting for the read while the
>> writes are likely cached. On the other hand write replies are tiny and
>> sending them first gets them out of the way and clears up dirty pages on
>> the client side faster. That might be beneficial too.
>>
>> What do you think?
>
> There's no need to specify that in the protocol. It may be how you choose
> to implement it in your server; but it might not be how I choose to
> implement it in mine. A good example is a RAID0/JBOD server where you
> might choose to split the incoming request queue by underlying physical
> device (and split requests spanning multiple devices into multiple
> requests). Each device's queue could be handled by a separate thread.
> This is perfectly permissible, and there needs to be no ordering between
> the queues.

Obviously. That was purely an implementation question.

I think you also misunderstood me. I didn't mean that incoming requests
should be ordered in this way but that pending outgoing replies should
be.

Say you are currently repying to a large read requests and the socket
blocks. While waiting for it some other requests finish and want to send
a reply. So you put all those requests into a queue for later
sending. Now FUA/FLUSH requests could be put at the head of the queue.

But I just thought about something else you wrote that makes this a bad
idea. You said that a FLUSH only ensures completed requests have been
flushed. So if a FLUSH ACKed before a WRITE then the client should
assume that WRITE wasn't yet flushed and issue another FLUSH. To prevent
this a FLUSH ACK should be after any WRITE ACK that it flushed to
disk. So there should be some limits on how much a FUA/FLUSH ACK can
skip other replies.

Maybe it is best to simply send out replies in the order they happen to
finish. Or send them in order they came in (only those waiting to be
send, no waiting).

>> I don't get the bit about
>>
>>    *  a) have a volatile write cache in your disk (e.g. any normal SATA
>> disk)
>>
>> Isn't that a major bug in sync_file_range then?
>
> It is one of those cases of a syscall that does not do what you think
> it does. All it does is cause write out to be initiated (and possibly
> wait until the writeout has completed). It does not ensure the writeout
> hits the underlying device.

It should be extended to have a FLUSH option then. For a simple case
that would be the same as fsync(). On a striped raid or multiple device
LV it could be reduced to only flush the required physical devices and
not all devices.

>> How does f(data)sync()
>> ensure data is on physical disk while sync_file_range returns when it
>> only transmitted to the disk?
>>
>> Is that because f(data)sync() will cause a FLUSH+FUA pair on the
>> underlying FS (or FLUSH on a device)?
>
> Yes. But it's more than that. If you write to a CoW based filing system,
> fsync (and even fdatasync) will ensure the CoW metadata is also flushed
> to the device, whereas sync_file_range won't. Without the CoW metadata
> being written, the data itself is not really written.

Which just means that a CoW based filing system or sparse files don't
support FUA. The idea of a FUA is that it is cheaper than a FLUSH. But
if nbd-server does fsync() in both cases then it is pointless to
announce FUA support.

>>> You don't. To be safe, I'd error every write (i.e. turn the medium
>>> read only).
>>
>> Why not return EIO on the next FLUSH? If I return success on the next
>> FLUSH that would make the client thing the write has successfully
>> migrated to the physical medium. Which would not be true.
>
> Because
> a) there may not be a next FLUSH at all

Then I will never know the write did have an error. I only see that on
fsync().

> b) the next FLUSH might come in a long time, and you really don't
>   want the FS to continue writing to a disk that has an error in.

Again, unless I do an fsync() I won't know about it. Even if I fsync()
every second there will potentially be a lot of writes gone by.

> c) the error checking on FLUSH may lose the error.
>
> To be clear, I wasn't saying "don't error the next flush". I am
> saying you should error ALL write-based commands including the next
> flush and every other flush.
>
>> And turning the device read-only seems like a bad idea. How would
>> badblocks be able to work then?
>
> The FS going readonly is exactly what happens if you have an error
> on a SCSI disk. I presume you then detach the drive, reattach and
> run badblocks.

Say you are running "mkfs -t ext4 -c -c /dev/nbd0". Now you hit one bad
block and the device turns itself into read-only mode. Not the behaviour
you want.

>> What happens if the client mounts a filesystem with sync option? Does
>> nbd then get every request with FUA set?
>
> No. The client sends FUA only when there is FUA on the request. The sync
> option is a server option, and the server merely syncs after every
> request.

It is also a mount option. Mounting a filesystem with sync in a local
disk and nbd should give the same behaviour.

>> It might make sense to implement a NBD_FLAG_SYNC option the nbd-client
>> can set in the handshake to force an export into sync mode.
>
> Oh I see, so it can force everything to be synchronous, whether the
> server (or the kernel) want it to or not.

Exactly.

>>>>   I'm considering haveing an --auto-sync mode that runs in sync mode
>>>>   when NBD_FLAG_SEND_FUA and NBD_FLAG_SEND_FLUSH is not used. This would
>>>>   be slow for old clients and fast for new ones but ensure data
>>>>   integrity for both. Maybe something nbd-server could use too.
>>>
>>> Note that even if you set NBD_FLAG_SEND_{FUA,FLUSH} you won't actually
>>> get FUA / FLUSH unless:
>>>
>>> a) the client supports it (because the client has to do a SETFLAGS
>>>   ioctl)
>>
>> No. The --auto-sync mode would take care of exactly that. If the client
>> does not supports FUA/FLUSH then sync would be activated.
>
> If you don't issue the ioctl (which you won't unless the client
> supports it), FLUSH and FUA won't be sent. This is to prevent the
> kernel sending FLUSH and FUA to servers that don't support them.
>
> The other way of doing this is to cause the server to do sync even
> when the sync option is not set. However, you will need client support
> for this too (so it can be negotiated), in which case not starting
> from a base position of a client that supports FLUSH and FUA would
> be perverse.

Hmpf, I thought the server would set NBD_FLAG_SEND_{FUA,FLUSH} and the
client would return wether it uses them. But that isn't the case, as you
write below. The server sets them and the client may or may not use
them. My bad.

>>> b) the kernel supports it
>>
>> This the client should check before telling the server it will do
>> FUA/FLUSH.
>
> You can check that with the ioctl (it will error if it doesn't
> support SETFLAGS). However, I think you may have the negotiation
> a bit backwards. Currently the server says "I support FLUSH and
> FUA, send them to me if you like". The client does not need to
> tell the server that it will or won't do FLUSH and FUA, and indeed
> has no way of knowing. The best it can do is permit the kernel
> to send FLUSH and FUA, which it does by calling the ioctl.
>
>>> c) the filing system supports it.
>>>
>>> Because of (c), you won't get any on ext2, or (depending on distribution)
>>> ext3, unless you mount with -obarriers=1 (some distros do this by
>>> default). ext4 will send them by default, as does btrfs and xfs.
>>
>> Nothing one can do there. If the FS doesn't work safe then it won't get
>> safe behaviour from the nbd-server. Same as if it had a physical disk. I
>> just don't want to make it less safe.
>
> Sure, but what I'm saying is you can't automagically detect lack of
> FLUSH and FUA (which might be for any number of reasons) and turn
> sync on.
>
>>> Therefore, even if you know (a) and (b) are satisfied, you should
>>> at least attempt to be safe if you never receive a single flush.
>>
>> Hmm, maybe work in sync mode until the first FUSH is recieved? That
>> might actually be better.
>
> But I might not want to work in sync mode. I might be quite happy
> with how ext2 style behaviour. I think your option of modifying
> the client to request sync mode is far easier, and far better
> than fiddling with the requests as sent by the kernel. The main
> advantage is it requires no kernel patches!

Then you wouldn't run the nbd-server with --auto-sync. This is ment as a
third option betwen no sync and sync. A sync unless FUA/FLUSH is being
used.

Having the client negotiate for sync mode is a different issue. Both
should be possible.

>>> No, you should not rely on this happening. Even umount of an ext2 volume
>>> will not send NBD_FLUSH where kernel, client, and server support it.
>>> You don't need to write it then and there (in fact there is no 'then
>>> and there' as an NBD_CMD_DISC has no reply), but you cannot guarantee
>>> *at all* that you will have received any sort of flush under any
>>> circumstances.
>>
>> I ment that if the client cares about the data being written before a
>> disconnect it has to explicitly flush. If it doesn't flush then that is
>> his problem.
>
> And what I am saying is that this is not current behaviour. Even with
> today's nbd release and my patched kernel (none of which you are guaranteed)
> you will not get one single REQ_FLUSH before dismounting an ext2 or
> (on Ubuntu and some other distros) ext3 filing system with default options.
> With an older kernel or client you will never get a REQ_FLUSH *ever*. So
> if you throw away data because it is not flushed when you get an
> NBD_CMD_DISC you *will corrupt the filing system*. Do not do this. You
> should treat NBD_CMD_DISC as containing an implicit flush (by which I
> mean buffer flush, not necessarily write to disk), which it always
> has done (it closes the files).

I'm not talking about throwing away any data. The data will be written
or the write requests wouldn't have been ACKed.

This is only about wether to do an implicit FLUSH (meaning fsync()) on
NBD_CMD_DISC or not. I think it is a good idea to do but not required.

>>>> * NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
>>>>   (sync_file_range() on the whole file, fdatasync() or fsync()
>>>>   returned).
>>>
>>> You only need to wait until any writes that you have sent replies
>>> for have been flushed to disk. It may be easier to write ore than than
>>> that (which is fine). Whilst you do not *have* to flush any commands
>>> issued (but not completed) before the REQ_FLUSH, Jan Kara says "*please*
>>> don't do that".
>>
>> Urgs. Assuming one doesn't flush commands that where issued but not yet
>> completed. How then should the client force those to disk? Sleep until
>> they happen to be ACKed and only then flush?
>
> Easy. The client issues a REQ_FLUSH *after* anything that needs to
> be flushed to disk have been ACK'd.

This would make ACKing writes only when they reach the physical medium
(using libaio, not fsync() every write) a total no go with a file backed
device. Is that really how Linux currently works? If that's true then I
really need to switch to ACKing requests as soon as the write is issued
and not when it completes.

>> I agree with Jan kara there. Please lets not do that. Lets document
>> (and implement) that a FLUSH command on nbd means all prior requests
>> are to be flushed and not just those already ACKed.
>
> I think the defined operational semantics should probably be those of
> the linux block layer, rather than us try and come up with something
> different. IE I don't think we should attempt to define the nbd
> protocol to be either more or less strict. We should say "it works
> the same as linux request base block driver semantics work - if you
> want to play fast and loose at the edges of permissibility, this
> is at your own risk".

Then we need to spell out what that behaviour exactly is:

a) A FLUSH affects at least all completed requests, a client must wait
   for request completion before sending a FLUSH.

b) A FLUSH might affect other requests. (Normaly those issued but not
   yet completed before the flush is issued.)

c) Requests should be ACKed as soon as possible to minimize the delay
   until a client can savely issue a FLUSH.

Does that reflect the linux block layers behaviour?

>>>>   I assume any delayed write errors should be returned here.
>>>
>>> No. There is really no such thing as a "delayed write error". Consider
>>> what happens with the current server. We write to the block cache.
>>> That hits physical disk some time later. If the physical disk errors,
>>> the normal fs response is for the disk to go read-only. You are right
>>> that flush will then (probably) error, but so will every other write
>>> command. Basically, you are stuffed if you get an error.
>>
>> NBD is a block device. So you have to look at what a local block device
>> will do. From userspace I get the following:
>>
>> write() -> data goes into cache, returns OK
>> fsync() -> data goes to the physical medium, returns EIO on error
>
> But the write *may* EIO too. Also you don't have to do the fsync.

Sure. Or any number of other errors. If write itself fails then that can
be returned directly. Asume that part works but syning to disk later
fails.

>> Saddly the chapter on the Block IO Layer in "Linux Kernel Developement"
>> is rather sparse with only 15 pages and half of them describing I/O
>> schedulers. So please excuse my ignorance.
>
> It all changed in 2.6.3x anyway.
>
>> How do filesystems even get the block number of a physical disk error?
>> My understanding is that a bio write goes into the disks cache and
>> completes. And later a REQ_FLUSH is send to the disk to force the data
>> onto the platter. Doesn't the disk only report an error on that
>> REQ_FLUSH when it can't write a block?
>
> A request driven driver simply errors the request, in which case it
> is passed up and errors the relevant bios (there may be more than one).
> The errored block number is largely irrelevant as there might not
> be one (REQ_FLUSH) or it might be outside the bio due to merging
> (that's my understanding anyway).

How? I mean that just pushes the issue down a layer. The physical disk
gets a write request, dumps the data into its cache and ACKs the
write. The driver passes up the ACK to the bio and the bio
completes. Then some time later the driver gets a REQ_FLUSH and the disk
returns a write error when it finds out it can't actually write the
block.

Color me ignorant but isn't that roughly how it will go with a disk with
write cacheing enabled?

MfG
        Goswin

Reply to:

Follow-Ups:
- Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Alex Bligh <alex@...872...>
- Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Wouter Verhelst <w@...112...>

References:
- [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Goswin von Brederlow <goswin-v-b@...186...>
- Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Alex Bligh <alex@...872...>
- Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Goswin von Brederlow <goswin-v-b@...186...>
- Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
  - From: Alex Bligh <alex@...872...>

Prev by Date: Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
Next by Date: Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
Previous by thread: Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
Next by thread: Re: [Nbd] Question about the expected behaviour of nbd-server for async ops
Index(es):
- Date
- Thread