Re: [Nbd] Design concept for async/multithreaded nbd-server
- To: email@example.com
- Subject: Re: [Nbd] Design concept for async/multithreaded nbd-server
- From: Goswin von Brederlow <goswin-v-b@...186...>
- Date: Fri, 09 Mar 2012 09:21:45 +0100
- Message-id: <87ipieyyjq.fsf@...860...>
- In-reply-to: <62B93D3E62545AECF3FC70FC@...873...> (Alex Bligh's message of "Thu, 08 Mar 2012 21:05:53 +0000")
- References: <87pqcutvko.fsf@...860...> <62B93D3E62545AECF3FC70FC@...873...>
Alex Bligh <alex@...872...> writes:
> comments in line
> --On 3 March 2012 00:46:47 +0100 Goswin von Brederlow
> <goswin-v-b@...186...> wrote:
>> There are multiple levels of async behaviour possible with more or less
>> improvement in speed and increase in risk of data loss:
>> 1) Handle requests in parallel but wait for each request to complete
>> before replying. This would involve using fsync/file_sync_range/msync to
>> ensure data reaches the physical disk (the disks write cache actually)
>> before replying. This would be perfectly safe.
>> 2) Handle requests in parallel but wait for each request to complete
>> before replying. But do not fsync/file_sync_range/msync unless required
>> by FUA or FLUSH. This would still be safe as long as the system does not
>> crash. An nbd-server crash would not result in data loss.
>> 3) Handle requests in parallel and reply imediatly on recieving write
>> requests. This would be the fastest but also involve the most risk. The
>> nbd-server would basically cache writes for a short while and a
>> nbd-server chrash would loose that data. Error detection would also be
>> problematic since requests have already been acknowledged by the time a
>> write error occurs. The error would have to be transmitted in reply to
>> the next FLUSH request or as a new kind of packet. So this might go to
> If the only client is linux (that's a big 'if'), or if the only specified
> level of synchronous behaviour of the client is 'as per linux kernel'
> (a rather smaller 'if'), then (3) is the way to go, as the linux block
> ordering semantic is very simple. In essence, if multiple requests are
> in flight, you can process and complete them in whatever order you want
Does that hold true with multiple clients using gfs or ocfs or similar?
Are the filesystems written in such a way to preserve that ordering
semantic with multiple clients?
> Your 'risks' in (3) do not exist with a linux client because they fs
> layer (of a flush/fua compliant fs) will not issue requests that would
> cause data loss without waiting for a reply to their flush/fua. Broken
> filing systems (e.g. ext2) are inherently unsafe anyway, as if you pull
> the power cord data may be in your HD cache. Errored writes are lost.
The problem there is that the client won't be able to resume operations
safely after a crash (persist option). The reconnect done when using
-persist is transparent to the fileystem, right?
To do a reconnect the client would have to resend all the requests since
the last FLUSH including those the server did reply already and I don't
think the lower layers are prepared for that.
> The point of the barrier design is to push the understanding of risk up
> to the FS layer, rather than handle it in the block device. IE if it
> needs to know something has 'hit the magnetic media' it uses FUA (etc.)
> and by not setting that, it expressly does not slow things down by
> ensuring stuff is written out with FUA that does not need to be written
> out FUA. Same with flush.
> I have done stats on this (admittedly with a rather different backend)
> and each of your proposals 1 and 2 is significantly slower than (3).
> However, I'd suggest you code whatever you are doing with a flag to
> implement this stuff (I did) so you can measure performance.
Actualy why would that be? The linux kernel handles multiple requests
on-the-fly and does not wait for the reply. Only a FUA/FLUSH will block.
The difference between 2 and 3 should only be how many requests are
on-the-fly, not how long the FUA/FLUSH takes. So unless the client hits
some max-requests-on-the-fly limit or runs out of memory to buffer
requests (and their data) there should be no difference in speed.
> For completeness, there is an option (4): do everything in parallel
> and ignore FLUSH and FUA completely. This goes even faster, but
> is clearly unsafe.
That is so unsafe I don't even consider testing that. No, if the server
enters a contract to honor FUA/FLUSH then it needs to do so.
For testing purposes though you can run nbd-server with an image in
tmpfs or using eatmydata to get that effect without having to
implement a thing in the server.
> For various work related reasons, I think I should stay out of the
> discussion on implementation. We are not doing the same thing (our
> backend is not a file), but I don't want to wander into potentially
> dangerous territory.
I'm implementing a stand alone server that exports a single disk (or set
of disks) like nbd-server. But that is just the fallback option for my
actual project. Basically just so I have a fallback server with
identical behaviour and features to the real deal.
My real implementation is a distributed raid with n data disks and m
redundancy disks. Sort of like raid4 but able to cope with up to m disks
failures. The limit is n + m <= 65536 so this could be a really big
storage pool. The redundancy calculations need strict serialization,
more than just "oh, let the kernel decide in what order to write", and
uses journaling. Data safety is a major concern there. Speed not so
So my thinking is to start the server in mode 1 and switch to mode 2
once the client sends a FUA/FLUSH request. That seems to be the only way
to detect that the client actualy supports and uses FUA/FLUSH.