[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



On Mon, May 30, 2011 at 12:29:44PM +0200, Goswin von Brederlow wrote:
> Wouter Verhelst <w@...112...> writes:
> 
> > On Sun, May 29, 2011 at 04:35:39PM +0200, Goswin von Brederlow wrote:
> >> Wouter Verhelst <w@...112...> writes:
> >> > I've been working on a multithreaded scatter/gather implementation of
> >> > the backend; the current code is in a 'scatgat' branch (which needs
> >> > updating for post-2.9.21 patches). It currently fails 'make check'
> >> > though, so it's certainly not ready yet.
> >> >
> >> > That implementation, once complete, will read a request from the socket,
> >> > check whether it is a read or write request and read the data in case of
> >> > a write request, submit the request to a GThreadPool to read from or
> >> > write to the backend storage, and use writev (which the man page tells
> >> > me is supposed to be atomic) to write the entire reply to the socket at
> >> > once (that is, the header plus the data in case of NBD_CMD_READ). So
> >> > reads from the socket are done in the main thread, and writes are done
> >> > from worker threads, and we don't need to use mutexes much, which is
> >> > good.
> >> 
> >> Where do you still need mutexes?
> >
> > Dunno. Don't take me too literally, unless I'm writing a standards
> > document.
> >
> >> > This would not care about ordering much, is a fairly simple
> >> > implementation (once I get rid of the bugs), and should improve
> >> > performance since reads and writes are no longer handled sequentially.
> >> 
> >> The other day on the train from work I put done some notes on how do to
> >> nbd-server using splice for zero-copy operations with verry similar
> >> results. Only difference is the writev part. The splice manpage says
> >> nothing about splice() being atomic. So in pseudo code the server looks
> >> like this:
> >> 
> >> Read thread:
> >> 
> >> 1) allocate a struct job { struct request *req; int pipe[2]; int res; }
> >> 2) reads in a request from the socket into job structure
> >> 3) create pipe to hold data and add to job structure
> >> 4) Add job to IO queue
> >> 5) On WRITE splice request->len byte from socket to pipe
> >> 6) On DISC: wait for all IO Threads to quit, add job to reply queue, quit
> >> 7) goto 1
> >> 
> >> IO Threads:
> >> 
> >> 1) Wait for job in IO queue and remove it from the queue
> >> On READ:
> >> 2a) add job to reply queue
> >> 3a) splice request->len byte from disk to pipe
> >> On WRITE:
> >> 2b) splice request->len bytes from pipe to disk
> >> 3b) add job to reply queue
> >> On FLUSH/FUA:
> >> 2c) fsync()
> >> 3c) add job to reply queue
> >> On DISC:
> >> 2d) add job back to IO queue /* make all IO threads die */
> >> 3d) quit
> >> 4) goto 1
> >> 
> >> Write thread:
> >> 
> >> 1) Wait for job in reply queue and remove it from queue
> >> 2) write reply to socket
> >> 3) On READ splice request->len byte from pipe to socket
> >> 4) close pipe
> >> 5) free job
> >> 6) On DISC quit
> >> 7) goto 1
> >
> > Sorry, but that just sounds overly complex. Compare:
> >
> > main thread:
> >
> > 1) Read socket into struct nbd_request (address of which is in first
> >    member of an array of struct iov)
> > 2) If disconnect: wait until all requests in the queue have been handled
> >    (there's a GThreadPool API call for that), close up and exit.
> > 3) If read: read data into second member of the struct iov array
> > 4) put in another struct along with some data that we need to properly
> >    handle the request, and add that struct to GThreadPool queue
> >
> > The GThreadPool API calls a function in a separate thread for every item
> > in the queue. The (maximum) number of threads is configurable at
> > creation time of the thread pool. So that function just needs to handle
> > one item which it's been given, and exit. What that does, is this:
> >
> > 1) The function has a parameter that is a pointer to our second struct.
> >    Cast that void*, dereference the struct iov array and the data that
> >    we need, and cast the first member to the request packet
> > 2) If read: read data from disk
> > 3) If write: write data to disk
> > 4) Use writev() to write the reply atomically to the client
> >
> > And there you are.
> 
> Which is basically the same as mine except for the writev().
> 
> >> One sore point is that in IO Threads the READ request is put into the
> >> reply queue before the data is read from disk.
> >
> > Why are you even putting data in a reply queue before you're ready to
> > handle it? That sounds like a fatal flaw to your approach.
> 
> On a large read request the splice() call will block because the pipe
> will be full. By putting the read request into the reply queue already
> the write thread will drain the pipe allowing the splice() to continue
> and eventually finish.

Ah, that way. Right.

> Which brings me to a little problem in your design:
> 
> Imagine what happens if a client requests a 1GB read. In my case the
> splice() blocks till the write thread starts sending out the data to the
> client. In your case you allocate 1GB of ram. Ups.

If we run out of memory, in my case I'll just return an error. There's
nothing in the protocol that specifies I have to support 1GB read
requests. Indeed, for the longest time the largest that we did support
in practice was about 1MB read requests, yet the number of times that
this produced an error was very low (the bug was fixed because one
high-volume user of nbd saw nbd-server crashing about once a month).

I don't think that not supporting extremely large requests is going to
be a problem in practice, providing we return a proper error code.

> I think you will have a problem dealing with large read/write requests
> in your setup. Your design doesn't allow for the read/write to be done
> in managable chunks.

There will indeed be an issue if the request is insanely large. But with
modern hardware, serving several requests in parallel of up to several
tens of megabytes should not be an issue, even with that design. And
even 'several tens of megabytes' is not going to be seen in the real
world, except by malicious clients (and I don't mind not doing what
malicious clients expect me to do...).

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula:

pi zz a



Reply to: