[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Question about the expected behaviour of nbd-server for async ops



Wouter Verhelst <w@...112...> writes:

> On Sun, May 29, 2011 at 04:35:39PM +0200, Goswin von Brederlow wrote:
>> Wouter Verhelst <w@...112...> writes:
>> > I've been working on a multithreaded scatter/gather implementation of
>> > the backend; the current code is in a 'scatgat' branch (which needs
>> > updating for post-2.9.21 patches). It currently fails 'make check'
>> > though, so it's certainly not ready yet.
>> >
>> > That implementation, once complete, will read a request from the socket,
>> > check whether it is a read or write request and read the data in case of
>> > a write request, submit the request to a GThreadPool to read from or
>> > write to the backend storage, and use writev (which the man page tells
>> > me is supposed to be atomic) to write the entire reply to the socket at
>> > once (that is, the header plus the data in case of NBD_CMD_READ). So
>> > reads from the socket are done in the main thread, and writes are done
>> > from worker threads, and we don't need to use mutexes much, which is
>> > good.
>> 
>> Where do you still need mutexes?
>
> Dunno. Don't take me too literally, unless I'm writing a standards
> document.
>
>> > This would not care about ordering much, is a fairly simple
>> > implementation (once I get rid of the bugs), and should improve
>> > performance since reads and writes are no longer handled sequentially.
>> 
>> The other day on the train from work I put done some notes on how do to
>> nbd-server using splice for zero-copy operations with verry similar
>> results. Only difference is the writev part. The splice manpage says
>> nothing about splice() being atomic. So in pseudo code the server looks
>> like this:
>> 
>> Read thread:
>> 
>> 1) allocate a struct job { struct request *req; int pipe[2]; int res; }
>> 2) reads in a request from the socket into job structure
>> 3) create pipe to hold data and add to job structure
>> 4) Add job to IO queue
>> 5) On WRITE splice request->len byte from socket to pipe
>> 6) On DISC: wait for all IO Threads to quit, add job to reply queue, quit
>> 7) goto 1
>> 
>> IO Threads:
>> 
>> 1) Wait for job in IO queue and remove it from the queue
>> On READ:
>> 2a) add job to reply queue
>> 3a) splice request->len byte from disk to pipe
>> On WRITE:
>> 2b) splice request->len bytes from pipe to disk
>> 3b) add job to reply queue
>> On FLUSH/FUA:
>> 2c) fsync()
>> 3c) add job to reply queue
>> On DISC:
>> 2d) add job back to IO queue /* make all IO threads die */
>> 3d) quit
>> 4) goto 1
>> 
>> Write thread:
>> 
>> 1) Wait for job in reply queue and remove it from queue
>> 2) write reply to socket
>> 3) On READ splice request->len byte from pipe to socket
>> 4) close pipe
>> 5) free job
>> 6) On DISC quit
>> 7) goto 1
>
> Sorry, but that just sounds overly complex. Compare:
>
> main thread:
>
> 1) Read socket into struct nbd_request (address of which is in first
>    member of an array of struct iov)
> 2) If disconnect: wait until all requests in the queue have been handled
>    (there's a GThreadPool API call for that), close up and exit.
> 3) If read: read data into second member of the struct iov array
> 4) put in another struct along with some data that we need to properly
>    handle the request, and add that struct to GThreadPool queue
>
> The GThreadPool API calls a function in a separate thread for every item
> in the queue. The (maximum) number of threads is configurable at
> creation time of the thread pool. So that function just needs to handle
> one item which it's been given, and exit. What that does, is this:
>
> 1) The function has a parameter that is a pointer to our second struct.
>    Cast that void*, dereference the struct iov array and the data that
>    we need, and cast the first member to the request packet
> 2) If read: read data from disk
> 3) If write: write data to disk
> 4) Use writev() to write the reply atomically to the client
>
> And there you are.

Which is basically the same as mine except for the writev().

>> One sore point is that in IO Threads the READ request is put into the
>> reply queue before the data is read from disk.
>
> Why are you even putting data in a reply queue before you're ready to
> handle it? That sounds like a fatal flaw to your approach.

On a large read request the splice() call will block because the pipe
will be full. By putting the read request into the reply queue already
the write thread will drain the pipe allowing the splice() to continue
and eventually finish.


Which brings me to a little problem in your design:

Imagine what happens if a client requests a 1GB read. In my case the
splice() blocks till the write thread starts sending out the data to the
client. In your case you allocate 1GB of ram. Ups.

I think you will have a problem dealing with large read/write requests
in your setup. Your design doesn't allow for the read/write to be done
in managable chunks.

MfG
        Goswin



Reply to: