[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Alex Bligh <alex@...872...> writes:

> Goswin,
>
> --On 29 May 2011 17:14:51 +0200 Goswin von Brederlow
> <goswin-v-b@...186...> wrote:
>> If only there was a write_fua() system call.
>>
>> Would it make sense (and result in correct behaviour) to open the
>> physical disk once normally and once with O_DATASYNC and do any write
>> with FUA flag over the O_SYNC fd? Would that perform better than doing
>> write()+fsync() on FUA?
>
> Yes absolutely. I asked Christoph H about this and he suggested this
> is the best way to do it. When I removed the sync_file_range() call
> I thought I left a comment saying this is what we should do. I just
> didn't have time to do it. As Christoph points out, right now, fsync()
> is not really an overhead because every FUA is next to a FLUSH (just
> because that's how filesystems migrated from the previous barrier system)
> however he plans that this may not be the case in XFS in future, and
> presumably other filesystems may diverge similarly.

I'm not sure that will ever change much. The problem is that the
filesystem has to make sure a number of blocks have been commited to
physical storage. And the only option are sending the request with FUA
in the first place, which probably detroys write cache performance on
the drive completly, or FLUSH the drive cache completly.

Isn't there a way to get the drive to tell you when it has actually
commited the data to physical storage and to flush specific
requests only?

> I can't speak for Wouter but I'd welcome a patch that did this right.
> Note you need to deal with /all/ the files that might be written to.
>
>>> No, CoW based filing systems *do* support FUA in that they send
>>> them out. Go trace what (e.g.) btrfs does.
>>
>> I'm saying an NBD server on a CoW based filing system can't properly do
>> FUA. Since sync_file_range() will not do the right thing there at all
>> there is no point in claiming to support FUA.
>
> But sync_file_range is not currently used. It *always* does an fsync().
>
> See the "#if 0" at line 1159 of nbd-server.c.
>
>> Let the client send FLUSH
>> requests instead. Same effect in the end.
>
> You can't control what the client sends (save for disabling stuff

The server tells the client wether it supports FUA. If it doesn't and
the client sends one then that is a protocol violation and should
probably abort the connection.

> you don't want to receive). The block layer will ask for a FUA call,
> and (if, e.g., Christoph does what he plans) you may well get a FUA
> request at the block layer with no nearby flush. So the best thing
> is to "do more than is required" (i.e. the fsync()) until one of us
> has the time to do O_DATASYNC properly. However, what I wanted to do
> was to get FUA into the protocol, because nbd-server is not the
> only possible server out there.

My understanding is that a FUA request from the upper layers gets turned
into a FLUSH automatically when the driver doesn't support FUA. So if
the nbd-client doesn't enable FUA for the kernel then any FUA request
from a filesystem should send a FLUSH over the socket. Right?

>>>> c) Requests should be ACKed as soon as possible to minimize the delay
>>>>    until a client can savely issue a FLUSH.
>>>
>>> That's probably true performance wise as a general point, but there is
>>> a complexity / safety / memory use tradeoff. If you ACK every request
>>> as soon as it comes in, you will use a lot of memory.
>>
>> How do you figure that? For me a write request (all others can be freed
>> once they send their reply) allways uses the same amount of memory from
>> the time it gets read from the socket till the time it is written to
>> disk (cache). The memory needed doesn't change wether you ACK it once it
>> is read from the socket, when the write is issued or when the write
>> returned.
>
> If you ACK a write request before you've written it somewhere, you
> need to keep it in memory so you can write it later. Imagine you
> just get a continuous stream of writes. If you do what you say
> (i.e. ACK them as soon as possible, i.e. before even writing them)
> the client will send them to you faster than you can deal with them,
> and you will end up eating lots of memory up buffering them. Essentially
> you are doing a write-behind cache. You will want to give up buffering
> them at some point and start blocking, because the advantage of having
> (e.g.) 1000 writes buffered over 100 is minimal (probably).

That should make no difference to the client. If the kernel has 1000
dirty pages it can legally send 1000 write request to the nbd-server
without waiting for a single ACK. As long as the filesystem (or whatever
uses the nbd device) doesn't run into a barrier and needs to drain its
queue (e.g. for fsync()) there is no limit on the number of in-flight
requests the kernel could have in parallel. Obviously in practice there
will be some limits on the client side regarding the amount of in-flight
requests and filesystems usualy hit a flush/fua all to quickly. The
maximum of in-flight data can probably be seen with a simple dd.

I agree that the server should have some limits on how much in-flight
data it will allow before it pauses to parse more requests. There should
probably be a config option to set this limit to prevent a client from
causing an OOM situation, say default 100MB. I don't think filesystems,
or other normal use, will hit that limit though.

MfG
        Goswin



Reply to: