[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Goswin,

Wouter: Could we make a decision here about the behaviour of a correct
nbd-server in this? Must it logically preserve the order or read/write
requests (i.e. return the value you would get if it had been done in
order) or can it implement the disordered behaviour that linux seems to
allow?

The later would be much simpler code wise.

I think we should probably involve the block layer people in this as
whilst the ability to disorder a read to go before a write appears
to be technically allowed, I am under the impression it hadn't
received a lot of testing. Perhaps the answer is that we simply
don't specify it in the protocol, but say "you must do what the
linux block layer expects".

True. And a read reply takes times (lots of data to send). In case there
are multiple replies pending it would make sense to order them so that
FUA/FLUSH get priority I think. After that I think all read replies
should go out in order of their request (oldest first) and write replies
last. Reason being that something will be waiting for the read while the
writes are likely cached. On the other hand write replies are tiny and
sending them first gets them out of the way and clears up dirty pages on
the client side faster. That might be beneficial too.

What do you think?

There's no need to specify that in the protocol. It may be how you choose
to implement it in your server; but it might not be how I choose to
implement it in mine. A good example is a RAID0/JBOD server where you
might choose to split the incoming request queue by underlying physical
device (and split requests spanning multiple devices into multiple
requests). Each device's queue could be handled by a separate thread.
This is perfectly permissible, and there needs to be no ordering between
the queues.

I don't get the bit about

   *  a) have a volatile write cache in your disk (e.g. any normal SATA
disk)

Isn't that a major bug in sync_file_range then?

It is one of those cases of a syscall that does not do what you think
it does. All it does is cause write out to be initiated (and possibly
wait until the writeout has completed). It does not ensure the writeout
hits the underlying device.

How does f(data)sync()
ensure data is on physical disk while sync_file_range returns when it
only transmitted to the disk?

Is that because f(data)sync() will cause a FLUSH+FUA pair on the
underlying FS (or FLUSH on a device)?

Yes. But it's more than that. If you write to a CoW based filing system,
fsync (and even fdatasync) will ensure the CoW metadata is also flushed
to the device, whereas sync_file_range won't. Without the CoW metadata
being written, the data itself is not really written.

You don't. To be safe, I'd error every write (i.e. turn the medium
read only).

Why not return EIO on the next FLUSH? If I return success on the next
FLUSH that would make the client thing the write has successfully
migrated to the physical medium. Which would not be true.

Because
a) there may not be a next FLUSH at all
b) the next FLUSH might come in a long time, and you really don't
  want the FS to continue writing to a disk that has an error in.
c) the error checking on FLUSH may lose the error.

To be clear, I wasn't saying "don't error the next flush". I am
saying you should error ALL write-based commands including the next
flush and every other flush.

And turning the device read-only seems like a bad idea. How would
badblocks be able to work then?

The FS going readonly is exactly what happens if you have an error
on a SCSI disk. I presume you then detach the drive, reattach and
run badblocks.

I suppose if one error is enough to make the FS go read-only you
might be safe just erroring one flush.

    The current behaviour is b I think

correct, but this is indistinguishable from a client point of view from
(a).

It makes a difference if the nbd-server process dies but the system
itself doesn't crash. With b the data will still end up on disk. Slight
difference. But I guess in both cases the power can fail and loose the
data and clients have to assume that might happen.

Right, but you might have (for instance) a write-log, so the next
start of nbd-server cleans things up. The assumption should be that
if FLUSH or FUA is complete, the relevant data is in non-volatile
storage (somehow), i.e. restarting after a power cut will result in
the data being there.

There's not much point the server asking the client to set a bit on
every request, because it can set it itself; it's /practically/ identical
to FUA on everything (I think there may be a difference in metadata,
i.e. syncing mtime).

What happens if the client mounts a filesystem with sync option? Does
nbd then get every request with FUA set?

No. The client sends FUA only when there is FUA on the request. The sync
option is a server option, and the server merely syncs after every
request.

It might make sense to implement a NBD_FLAG_SYNC option the nbd-client
can set in the handshake to force an export into sync mode.

Oh I see, so it can force everything to be synchronous, whether the
server (or the kernel) want it to or not.

  I'm considering haveing an --auto-sync mode that runs in sync mode
  when NBD_FLAG_SEND_FUA and NBD_FLAG_SEND_FLUSH is not used. This would
  be slow for old clients and fast for new ones but ensure data
  integrity for both. Maybe something nbd-server could use too.

Note that even if you set NBD_FLAG_SEND_{FUA,FLUSH} you won't actually
get FUA / FLUSH unless:

a) the client supports it (because the client has to do a SETFLAGS
  ioctl)

No. The --auto-sync mode would take care of exactly that. If the client
does not supports FUA/FLUSH then sync would be activated.

If you don't issue the ioctl (which you won't unless the client
supports it), FLUSH and FUA won't be sent. This is to prevent the
kernel sending FLUSH and FUA to servers that don't support them.

The other way of doing this is to cause the server to do sync even
when the sync option is not set. However, you will need client support
for this too (so it can be negotiated), in which case not starting
from a base position of a client that supports FLUSH and FUA would
be perverse.

b) the kernel supports it

This the client should check before telling the server it will do
FUA/FLUSH.

You can check that with the ioctl (it will error if it doesn't
support SETFLAGS). However, I think you may have the negotiation
a bit backwards. Currently the server says "I support FLUSH and
FUA, send them to me if you like". The client does not need to
tell the server that it will or won't do FLUSH and FUA, and indeed
has no way of knowing. The best it can do is permit the kernel
to send FLUSH and FUA, which it does by calling the ioctl.

c) the filing system supports it.

Because of (c), you won't get any on ext2, or (depending on distribution)
ext3, unless you mount with -obarriers=1 (some distros do this by
default). ext4 will send them by default, as does btrfs and xfs.

Nothing one can do there. If the FS doesn't work safe then it won't get
safe behaviour from the nbd-server. Same as if it had a physical disk. I
just don't want to make it less safe.

Sure, but what I'm saying is you can't automagically detect lack of
FLUSH and FUA (which might be for any number of reasons) and turn
sync on.

Therefore, even if you know (a) and (b) are satisfied, you should
at least attempt to be safe if you never receive a single flush.

Hmm, maybe work in sync mode until the first FUSH is recieved? That
might actually be better.

But I might not want to work in sync mode. I might be quite happy
with how ext2 style behaviour. I think your option of modifying
the client to request sync mode is far easier, and far better
than fiddling with the requests as sent by the kernel. The main
advantage is it requires no kernel patches!

No, you should not rely on this happening. Even umount of an ext2 volume
will not send NBD_FLUSH where kernel, client, and server support it.
You don't need to write it then and there (in fact there is no 'then
and there' as an NBD_CMD_DISC has no reply), but you cannot guarantee
*at all* that you will have received any sort of flush under any
circumstances.

I ment that if the client cares about the data being written before a
disconnect it has to explicitly flush. If it doesn't flush then that is
his problem.

And what I am saying is that this is not current behaviour. Even with
today's nbd release and my patched kernel (none of which you are guaranteed)
you will not get one single REQ_FLUSH before dismounting an ext2 or
(on Ubuntu and some other distros) ext3 filing system with default options.
With an older kernel or client you will never get a REQ_FLUSH *ever*. So
if you throw away data because it is not flushed when you get an
NBD_CMD_DISC you *will corrupt the filing system*. Do not do this. You
should treat NBD_CMD_DISC as containing an implicit flush (by which I
mean buffer flush, not necessarily write to disk), which it always
has done (it closes the files).

* NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data
  (sync_file_range() on the whole file, fdatasync() or fsync()
  returned).

You only need to wait until any writes that you have sent replies
for have been flushed to disk. It may be easier to write ore than than
that (which is fine). Whilst you do not *have* to flush any commands
issued (but not completed) before the REQ_FLUSH, Jan Kara says "*please*
don't do that".

Urgs. Assuming one doesn't flush commands that where issued but not yet
completed. How then should the client force those to disk? Sleep until
they happen to be ACKed and only then flush?

Easy. The client issues a REQ_FLUSH *after* anything that needs to
be flushed to disk have been ACK'd.

I agree with Jan kara there. Please lets not do that. Lets document (and
implement) that a FLUSH command on nbd means all prior requests are to
be flushed and not just those already ACKed.

I think the defined operational semantics should probably be those of
the linux block layer, rather than us try and come up with something
different. IE I don't think we should attempt to define the nbd
protocol to be either more or less strict. We should say "it works
the same as linux request base block driver semantics work - if you
want to play fast and loose at the edges of permissibility, this
is at your own risk".

  I assume any delayed write errors should be returned here.

No. There is really no such thing as a "delayed write error". Consider
what happens with the current server. We write to the block cache.
That hits physical disk some time later. If the physical disk errors,
the normal fs response is for the disk to go read-only. You are right
that flush will then (probably) error, but so will every other write
command. Basically, you are stuffed if you get an error.

NBD is a block device. So you have to look at what a local block device
will do. From userspace I get the following:

write() -> data goes into cache, returns OK
fsync() -> data goes to the physical medium, returns EIO on error

But the write *may* EIO too. Also you don't have to do the fsync.

Saddly the chapter on the Block IO Layer in "Linux Kernel Developement"
is rather sparse with only 15 pages and half of them describing I/O
schedulers. So please excuse my ignorance.

It all changed in 2.6.3x anyway.

How do filesystems even get the block number of a physical disk error?
My understanding is that a bio write goes into the disks cache and
completes. And later a REQ_FLUSH is send to the disk to force the data
onto the platter. Doesn't the disk only report an error on that
REQ_FLUSH when it can't write a block?

A request driven driver simply errors the request, in which case it
is passed up and errors the relevant bios (there may be more than one).
The errored block number is largely irrelevant as there might not
be one (REQ_FLUSH) or it might be outside the bio due to merging
(that's my understanding anyway).

What is the behaviour when doing an fsync() in one thread followed by a
write() in a second thread (while fsync() still runs)?

Say you have:

Thread 1                Thread 2
                        write 0 to block
                        fsync
write 1 to block
fsync
                        write 2 to block
fsync returns

Is it garantied that block now contains 1 or 2 or could the last write
cancel the second as unneccessary but not sync the thrid? Could one end
up with block still containing 0? I'm assuming it can't but lets be
sure.

I am not sure whether you are talking about writing to the same fd, or
to two different FDs mapped to the same file. I am not sure the
former is ever a good idea. The latter works, and I think the semantic
is that you can only be sure that data where data for which a write
completed PRIOR to an fsync starting is written. I'm not sure
in your example where the right hand fsync returns.

However, note the semantics of REQ_FLUSH are not necessarily the
same as fsync().

--
Alex Bligh



Reply to: