[Nbd] Design concept for async/multithreaded nbd-server
I kind of want to just gather my thoughts on this and maybe use you as a
sounding board. Hopefully this might also come in handy if you rewrite
the nbd-server with threads.
The story so far:
The Linux kernel supports having multiple requests in the air and
handles out-of-order replies to requests. There is also a patch for the
nbd kernel module to support FLUSH/FUA/TRIMM that would make true
asynchronous request handling safe.
The nbd-server is single threaded and uses synchronous IO. Each request
is completed and replied before the next request is handled. That means
for example that a small read of a cached block has to wait for a large
write preceeding it to complete. Also write requests wait for the write
to actually complete before replying.
Where do we go from here?
There are multiple levels of async behaviour possible with more or less
improvement in speed and increase in risk of data loss:
1) Handle requests in parallel but wait for each request to complete
before replying. This would involve using fsync/file_sync_range/msync to
ensure data reaches the physical disk (the disks write cache actually)
before replying. This would be perfectly safe.
2) Handle requests in parallel but wait for each request to complete
before replying. But do not fsync/file_sync_range/msync unless required
by FUA or FLUSH. This would still be safe as long as the system does not
crash. An nbd-server crash would not result in data loss.
3) Handle requests in parallel and reply imediatly on recieving write
requests. This would be the fastest but also involve the most risk. The
nbd-server would basically cache writes for a short while and a
nbd-server chrash would loose that data. Error detection would also be
problematic since requests have already been acknowledged by the time a
write error occurs. The error would have to be transmitted in reply to
the next FLUSH request or as a new kind of packet. So this might go to
Requests could also be handled out-of-order. A read request send before
an overlapping write request could reply with the data of the write. I
do not believe that that would be correct behaviour. With multiple
client connections the order in which requests from different clients
are recieved is somewhat random. I would still serialize them in the
order in which they are recieved. A write from client A should not cut
in front of a read from client B.
Handling multiple requests in parallel means that there could be
overlaps between requests. Esspecially if the server supports multiple
client connections. So some synchronization feature should be used.
Idea 1: block/segment cache
Every request recieved is imediatly entered into a block or segment
cache in linear/serialized way. A request could be split up into fixed
sized blocks (e.g. blocksize large chunks) and added in a simple
hashtable. Or a (balanced) tree of segments could be used to store
In both cases a new write request would replace existing write entries
if they have not yet started their IO. Entries that have started their
IO have to be shadowed. That way only the latest data is visible.
Read requests would also be entered in the cache. If the cache already
has the data for the request then it can be used directly. If the cache
has an entry but no data yet then the request needs to add itself to the
existing entry, to be called back when the data arrives. Otherwise a new
entry has to be created and IO started. Note that overlaps can be
partial so a large read request might have bits and pices in cache and
others missing. Read requests are always shadowed by write requests so
that the read can still returns the old data (the data at the time the
read was started).
I thing splitting requests into blocks and using a hashtable is by far
the simpler solution. With a tree of segment overlaps would require
splitting segments for which IO is already running. Balancing is also
Idea 2: IO queues
The block/segment cache takes care of synchonizing our requests. But the
data will have to move from there to the disk in some way.
When using threads I would make a queue of all pending IO. Each IO
thread would wait on the queue, when woken take the first entry,
performe the read or write specified and run the callbacks for the entry
when done. Alternatively there could be a result queue where it dumps
the entry and the callback could be handled in different threads.
When using Posix AIO or Linux AIO the IO could be started emidiatly upon
entering the block/segment cache as they internaly already queue IO and
do not block.
In both cases the server would notice when the IO has completed. If a
request was split into multiple IOs then it needs to wait for all of
them to finish. Then the reply can be send.
A FUA request could use either O_SYNC (open the disk twice, once with
O_SYNC and once without) or use file_sync_range/msync to force the data
onto the disk.
A FLUSH request would need to wait for all queued requests and running
IO to complete before sending a reply. This could be done in one of three
- stop reading from all sockets
- wait for the queue to drain
- wait for all IO threads to finish
- start reading from all sockets again
2) barrier (only when using an IO queue)
- Set up a spin-lock primed with the number of IO threads
- insert a barrier token in the IO queue
- upon hitting the barrier IO threads
+ do not remove the barrier from the queue
+ decrement the spin-lock
+ go to sleep
- wait for the spin-lock to reach 0
- remove the barrier from the IO queue and wake up IO threads
This would allow read requests of cache data to complete but would
block all write requests and uncached reads.
Instead of blocking operations a FLUSH request could add itself to
all running and queue write requests as callback (record their
number). Once all of them have completed the FLUSH can reply. This
way a flush would not block operations. It would not act as a
barrier. Another client (or even the client issuing FLUSH) could
still issue requests. The drawback would be that the flush could take
longer because other IO requests are scheduled first by the kernel.
Note: In a threaded implementation it might be tricky to add
callbacks on the fly like that.
So what do you think?
NOTE: When using a single thread and AIO one could also rely on the
serialization and caching the kernel does for IO and simple start each
request in the order they come in and reply when the callback is
triggered. This actualy makes for a really simple implementation but
means that the behaviour of the nbd-server would be dicated by the
behaviour of the underlying AIO layer. I'm not sure if that would result
in a consistent behaviour across platforms.
NOTE2: When using threads the nbs-server could use splice instead of
read/write under linux to get zero-copy behaviour. That might be a good
reason to prefer a more complicated solution with cache and queue over
simply calling AIO.