[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



On Sun, May 29, 2011 at 04:35:39PM +0200, Goswin von Brederlow wrote:
> Wouter Verhelst <w@...112...> writes:
> > I've been working on a multithreaded scatter/gather implementation of
> > the backend; the current code is in a 'scatgat' branch (which needs
> > updating for post-2.9.21 patches). It currently fails 'make check'
> > though, so it's certainly not ready yet.
> >
> > That implementation, once complete, will read a request from the socket,
> > check whether it is a read or write request and read the data in case of
> > a write request, submit the request to a GThreadPool to read from or
> > write to the backend storage, and use writev (which the man page tells
> > me is supposed to be atomic) to write the entire reply to the socket at
> > once (that is, the header plus the data in case of NBD_CMD_READ). So
> > reads from the socket are done in the main thread, and writes are done
> > from worker threads, and we don't need to use mutexes much, which is
> > good.
> 
> Where do you still need mutexes?

Dunno. Don't take me too literally, unless I'm writing a standards
document.

> > This would not care about ordering much, is a fairly simple
> > implementation (once I get rid of the bugs), and should improve
> > performance since reads and writes are no longer handled sequentially.
> 
> The other day on the train from work I put done some notes on how do to
> nbd-server using splice for zero-copy operations with verry similar
> results. Only difference is the writev part. The splice manpage says
> nothing about splice() being atomic. So in pseudo code the server looks
> like this:
> 
> Read thread:
> 
> 1) allocate a struct job { struct request *req; int pipe[2]; int res; }
> 2) reads in a request from the socket into job structure
> 3) create pipe to hold data and add to job structure
> 4) Add job to IO queue
> 5) On WRITE splice request->len byte from socket to pipe
> 6) On DISC: wait for all IO Threads to quit, add job to reply queue, quit
> 7) goto 1
> 
> IO Threads:
> 
> 1) Wait for job in IO queue and remove it from the queue
> On READ:
> 2a) add job to reply queue
> 3a) splice request->len byte from disk to pipe
> On WRITE:
> 2b) splice request->len bytes from pipe to disk
> 3b) add job to reply queue
> On FLUSH/FUA:
> 2c) fsync()
> 3c) add job to reply queue
> On DISC:
> 2d) add job back to IO queue /* make all IO threads die */
> 3d) quit
> 4) goto 1
> 
> Write thread:
> 
> 1) Wait for job in reply queue and remove it from queue
> 2) write reply to socket
> 3) On READ splice request->len byte from pipe to socket
> 4) close pipe
> 5) free job
> 6) On DISC quit
> 7) goto 1

Sorry, but that just sounds overly complex. Compare:

main thread:

1) Read socket into struct nbd_request (address of which is in first
   member of an array of struct iov)
2) If disconnect: wait until all requests in the queue have been handled
   (there's a GThreadPool API call for that), close up and exit.
3) If read: read data into second member of the struct iov array
4) put in another struct along with some data that we need to properly
   handle the request, and add that struct to GThreadPool queue

The GThreadPool API calls a function in a separate thread for every item
in the queue. The (maximum) number of threads is configurable at
creation time of the thread pool. So that function just needs to handle
one item which it's been given, and exit. What that does, is this:

1) The function has a parameter that is a pointer to our second struct.
   Cast that void*, dereference the struct iov array and the data that
   we need, and cast the first member to the request packet
2) If read: read data from disk
3) If write: write data to disk
4) Use writev() to write the reply atomically to the client

And there you are.

> One sore point is that in IO Threads the READ request is put into the
> reply queue before the data is read from disk.

Why are you even putting data in a reply queue before you're ready to
handle it? That sounds like a fatal flaw to your approach.

> For a multi disk server there would also be one master that listens on
> the NBD port. On startup it would start the IO threads. On connect it
> would start a read+write thread for the requested device and
> NBD_CMD_DISC would skip the IO queue and go directly to the reply queue
> so the IO threads don't die.

Eventually I plan on doing something like that, yes. But currently the
code assumes 'if the client misbehaves, I can just die and our
conversation is over' in too many places.

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula:

pi zz a



Reply to: