[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Alex Bligh <alex@...872...> writes:

> Goswin,
>
> --On 29 May 2011 14:53:01 +0200 Goswin von Brederlow
> <goswin-v-b@...186...> wrote:
>
>> That really sucks documentation wise. Because then you have to start a
>> hunt for further documentation which probably doesn't even exist other
>> than the source.
>>
>> It is ok to say we implement what the linux block layer expects but it
>> should be spelled out in the text or at least name a file to look at for
>> the details. It should not be left this vague.
>
> Point taken. However, it does mean we would end up documenting
> in practice how the linux block layer behaves. Which itself is
> not static.

Which makes it even more important. We are implementing how the block
layer behaves now and not how it behaves next year. So if next year
things break down you need some document telling what behaviour was
implemented as opposed to what linux happens to have now. Otherwise you
can't say if the problem is a bug in the impementation or an
incompatibility with newer linux.

If the two become incompatible then a new protocol version needs to be
done.

>>>> True. And a read reply takes times (lots of data to send). In case there
>>>> are multiple replies pending it would make sense to order them so that
>>>> FUA/FLUSH get priority I think. After that I think all read replies
>>>> should go out in order of their request (oldest first) and write replies
>>>> last. Reason being that something will be waiting for the read while the
>>>> writes are likely cached. On the other hand write replies are tiny and
>>>> sending them first gets them out of the way and clears up dirty pages on
>>>> the client side faster. That might be beneficial too.
>>>>
>>>> What do you think?
>>>
>>> There's no need to specify that in the protocol. It may be how you choose
>>> to implement it in your server; but it might not be how I choose to
>>> implement it in mine. A good example is a RAID0/JBOD server where you
>>> might choose to split the incoming request queue by underlying physical
>>> device (and split requests spanning multiple devices into multiple
>>> requests). Each device's queue could be handled by a separate thread.
>>> This is perfectly permissible, and there needs to be no ordering between
>>> the queues.
>>
>> Obviously. That was purely an implementation question.
>>
>> I think you also misunderstood me. I didn't mean that incoming requests
>> should be ordered in this way but that pending outgoing replies should
>> be.
>
> I don't think I misunderstood. What I meant was that in such a JBOD
> situation, one disk X might be replying more slowly than disk Y
> (say because it happens to have a more seeky load). So a read
> request to disk Y issued after a read request to disk X might
> result in a disk Y reply coming before the reply for disk X.

Exactly. The read reply for disk y would be ready to send before the one
for disk X. My case would only arise if both disk X and disk Y are ready
to reply. Which one gets to reply first?

>> But I just thought about something else you wrote that makes this a bad
>> idea. You said that a FLUSH only ensures completed requests have been
>> flushed.
>
> yes
>
>> So if a FLUSH ACKed before a WRITE then the client should
>> assume that WRITE wasn't yet flushed and issue another FLUSH.
>
> If a client wants to flush a particular write, it should just not
> issue the flush until the write is ACK'd. No need to issue two
> flushes. This is the way the request system works, not my design!
>
>> To prevent
>> this a FLUSH ACK should be after any WRITE ACK that it flushed to
>> disk.
>
> No, it need only come after it has flushed any acknowledged writes.
> IE it can send the ACK before the ACK of any other writes (for
> instance unacknowledged ones issued before the flush) it happened to flush
> to disk at the same time.
>
>> So there should be some limits on how much a FUA/FLUSH ACK can
>> skip other replies.
>
> I don't see why it has to be any different to the current linux
> request model. It's a bit odd in some ways, but it is a working system.

Assume it all comes from ignorance of the linux request model. :) But
see other mails, this has been cleared up by now.

[...]
> FUA is normally handled simply by keeping the bit in the request
> and ensuring it writes through.

If only there was a write_fua() system call.

Would it make sense (and result in correct behaviour) to open the
physical disk once normally and once with O_DATASYNC and do any write
with FUA flag over the O_SYNC fd? Would that perform better than doing
write()+fsync() on FUA?

>>> Yes. But it's more than that. If you write to a CoW based filing system,
>>> fsync (and even fdatasync) will ensure the CoW metadata is also flushed
>>> to the device, whereas sync_file_range won't. Without the CoW metadata
>>> being written, the data itself is not really written.
>>
>> Which just means that a CoW based filing system or sparse files don't
>> support FUA.
>
> No, CoW based filing systems *do* support FUA in that they send
> them out. Go trace what (e.g.) btrfs does.

I'm saying an NBD server on a CoW based filing system can't properly do
FUA. Since sync_file_range() will not do the right thing there at all
there is no point in claiming to support FUA. Let the client send FLUSH
requests instead. Same effect in the end.

> I think you are confusing block layer semantics (REQ_FLUSH and
> REQ_FUA). There is no VFS equivalent of either REQ_FLUSH or
> REQ_FUA. fsync() on a file does roughly what REQ_FLUSH does.
> Opening a second file with O_DATASYNC set and writing the blocks
> to that does roughly what REQ_FUA does (an fdatasync() does
> rather more). nbd-server currently does "more than it needs"
> for REQ_FUA, but given that almost all REQ_FUA are immediately
> followed by a REQ_FLUSH, and 2 x fdatasync in a row are no
> more work than one, this doesn't matter.

I think that answeres my question from above about opening the disk
twice.

>> The idea of a FUA is that it is cheaper than a FLUSH. But
>> if nbd-server does fsync() in both cases then it is pointless to
>> announce FUA support.
>
> Well, FUA could (and will if I have a minute) be implemented using
> a shadow file and O_DATASYNC. I think there is a comment to that
> effect.

So lets only advertise that we can do FUA when we actualy do it better
than FLUSH, when you have a shadow file or other impementation for
it. Otherwise be truthfull and say we only do FLUSH.

I think that should also be recommended in the format specs. Clients
might behave differently when they know they only have FLUSH.

>> Say you are running "mkfs -t ext4 -c -c /dev/nbd0". Now you hit one bad
>> block and the device turns itself into read-only mode. Not the behaviour
>> you want.
>
> Well, assuming you want the mkfs to carry on, and it wants to know
> where block errors are, should be
> opening the disk with O_SYNC (or O_DATASYNC these days) which will
> translate into REQ_FUA as I understand it.

That was one of my earlier quetions. :) Or close to one. I haven't
patched my kernel yet to do FLUSH/FUA on NBD but maybe you could verify
that this is actualy the behaviour.

1) open with O_SYNC/O_DATASYNC makes all write requests have FUA set
2) 'mount -on sync /dev/nbd0 /mnt/tmp; echo >/mnt/tmp/foo' makes all
   write requests have FUA set

> Consider a normal SATA disk with a write-behind cache (forget nbd
> for a minute). From memory mkfs just does normal block writes. It
> may do an fsync() on the block device which results in a sync
> at the end. It has no way of knowing where bad blocks are anyway.
> (In practice SATA devices do their own bad block management but
> you probably know that).

Indeed. It needs O_SYNC to catch write errors on the badblock pass.

>>>> What happens if the client mounts a filesystem with sync option? Does
>>>> nbd then get every request with FUA set?
>>>
>>> No. The client sends FUA only when there is FUA on the request. The sync
>>> option is a server option, and the server merely syncs after every
>>> request.
>>
>> It is also a mount option. Mounting a filesystem with sync in a local
>> disk and nbd should give the same behaviour.
>
> A mount option is something different. That will (as I understand it)
> cause the block layer to work synchronously, and you will get
> FUA/FLUSH/whatever. No client negotiation (beyond advertising
> support of these) is needed. That's just how a sync mount option
> would work with any other block device.
>
> Note nbd devices don't have to be mounted. What you are proposing
> would affect raw I/O (not mounted I/O) to these block devices.

We are starting to confuse threads of thought. :)

In a perfect world I want to have 2 seperate options:

1) mount -o sync works as securely on NBD as on local disks
2) client option to negotiate SYNC behaviour with the server independent
   of the use of /dev/nbdX. (optionally only when server doesn't support
   FLUSH).

>> Then we need to spell out what that behaviour exactly is:
>>
>> a) A FLUSH affects at least all completed requests, a client must wait
>>    for request completion before sending a FLUSH.
>
> Yes. Except the client only need wait for completion of those requests
> it wants to ensure are flushed (not every request).

Yes. This should easily happen with multiple LVs on a NBD. A flush of
one LV has to only drain the queue for that LV and can then flush
irespective of what the other LVs do. Are filesystems (ext4? btrfs?)
smart enough to track what requests they need to wait for for a flush
and which they don't need? I guess that makes a huge difference for
fsync() on one file while another is being written to.

>> b) A FLUSH might affect other requests. (Normaly those issued but not
>>    yet completed before the flush is issued.)
>
> Yes. You can always flush more than is required.
>
>> c) Requests should be ACKed as soon as possible to minimize the delay
>>    until a client can savely issue a FLUSH.
>
> That's probably true performance wise as a general point, but there is
> a complexity / safety / memory use tradeoff. If you ACK every request
> as soon as it comes in, you will use a lot of memory.

How do you figure that? For me a write request (all others can be freed
once they send their reply) allways uses the same amount of memory from
the time it gets read from the socket till the time it is written to
disk (cache). The memory needed doesn't change wether you ACK it once it
is read from the socket, when the write is issued or when the write
returned.

MfG
        Goswin



Reply to: