[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Design concept for async/multithreaded nbd-server


comments in line

--On 3 March 2012 00:46:47 +0100 Goswin von Brederlow <goswin-v-b@...186...> wrote:

There are multiple levels of async behaviour possible with more or less
improvement in speed and increase in risk of data loss:

1) Handle requests in parallel but wait for each request to complete
before replying. This would involve using fsync/file_sync_range/msync to
ensure data reaches the physical disk (the disks write cache actually)
before replying. This would be perfectly safe.

2) Handle requests in parallel but wait for each request to complete
before replying. But do not fsync/file_sync_range/msync unless required
by FUA or FLUSH. This would still be safe as long as the system does not
crash. An nbd-server crash would not result in data loss.

3) Handle requests in parallel and reply imediatly on recieving write
requests. This would be the fastest but also involve the most risk. The
nbd-server would basically cache writes for a short while and a
nbd-server chrash would loose that data. Error detection would also be
problematic since requests have already been acknowledged by the time a
write error occurs. The error would have to be transmitted in reply to
the next FLUSH request or as a new kind of packet. So this might go to

If the only client is linux (that's a big 'if'), or if the only specified
level of synchronous behaviour of the client is 'as per linux kernel'
(a rather smaller 'if'), then (3) is the way to go, as the linux block
ordering semantic is very simple. In essence, if multiple requests are
in flight, you can process and complete them in whatever order you want

Your 'risks' in (3) do not exist with a linux client because they fs
layer (of a flush/fua compliant fs) will not issue requests that would
cause data loss without waiting for a reply to their flush/fua. Broken
filing systems (e.g. ext2) are inherently unsafe anyway, as if you pull
the power cord data may be in your HD cache. Errored writes are lost.

The point of the barrier design is to push the understanding of risk up
to the FS layer, rather than handle it in the block device. IE if it
needs to know something has 'hit the magnetic media' it uses FUA (etc.)
and by not setting that, it expressly does not slow things down by
ensuring stuff is written out with FUA that does not need to be written
out FUA. Same with flush.

I have done stats on this (admittedly with a rather different backend)
and each of your proposals 1 and 2 is significantly slower than (3).
However, I'd suggest you code whatever you are doing with a flag to
implement this stuff (I did) so you can measure performance.

For completeness, there is an option (4): do everything in parallel
and ignore FLUSH and FUA completely. This goes even faster, but
is clearly unsafe.

For various work related reasons, I think I should stay out of the
discussion on implementation. We are not doing the same thing (our
backend is not a file), but I don't want to wander into potentially
dangerous territory.

Alex Bligh

Reply to: