[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] Bug with large reads and protocol issue



Alex Bligh <alex@...872...> writes:

> I have found an interesting problem with large reads.
>
> I have been trying to ascertain what the correct protocol is
> for read errors.
>
> What ndd-server currently does is process the read in chunks
> of BUF_SIZE size. If any chunk errors, it sends an error
> response. This is problematic because the server cannot
> correctly process an error response if it is sent half-way
> through a stream of data blocks. It causes the connection
> to hang. As the error code may be interpreted as data,
> which might be acted upon, it is theoretically possible that
> this might cause corruption (though this is unlikely
> with the current client as the error response is so
> much smaller than a block).
>
> Reading the protocol, there is only one possible interpretation
> of what is meant to happen (as far as I can tell). Either
> the response is meant to error, in which case no data is
> sent at all, or the response does not error, in which case
> all the data is meant to be sent. There is (rightly) no
> "send half the data and an error" variant.
>
> But this is really problematic for the reason set out below.
>
> Let's suppose that a given server can handle large reads
> efficiently. What I want to do is to start sending the data to the
> tcp channel before I've read all the data. This is in fact
> what nbd-server attempts to do right now in the read is
> bigger than BUF_SIZE.

Does that really make a difference past e.g. 128k chunks if pipelining
is allowed? 1MB chunks? 10MB chunks? I think the idea of the server
communicating a max request size to the client and sticking to that is
the best here.

> The problem occurs if a read other than the first errors (or
> more accurately if any read errors after we have sent any data).
> How do we represent that error to the server? We've already
> returned that the operation has succeeded.
>
> To do proper error handling (which nbd-server doesn't, as far
> as I can tell), we'd need to save the whole read in memory,
> which is (a) memory inefficient, and (b) throughput inefficient
> as we'd have to buffer the entire read.
>
> One answer to this is "don't use large reads, then". However,
> in certain situations (e.g. servers than can parallelize
> requests), it's far more efficient to do larger reads.
> Even now, we wait until a large amount of data has been read
> before sending any.

Huh? A large read is better because the kernel can do better ordering of
the requests (in case the storage backend is fragmented) or do better
read ahead.

But if the server can parallelize requests the a set of small reads for
the same chunk will all end up in the kernel space in parallel result in
basically the same performance (hopefully). Small reads will cost some
syscalls but that should be neglible against the disk speed.

So I wouldn't go so far as saying "far more efficient" without measuring
it first.

> Given that errors are really unlikely in the great scheme of
> things, a relatively low overhead solution to this would be
> to send the read followed by the error code (again) (we could
> signal this by returning "EDONTKNOW" or something in the
> original error field). If this was non-zero, the client would
> discard all the data and use this as the error code.
> This would waste 4 octets on every read reply where EDONTKNOW
> was used, which would solely be large read requests. Obviously
> as EDONTKNOW would be sent at the end of a large read, if there
> is an early error, we'd have to send a large amount of junk
> over TCP in the event of an error, but this is hardly
> a problem.

And fill the failed data with 0 bytes?

That scheme would mean the client would have to buffer the complete read
in memory and couldn't streamline it to the application as it comes
in. For an exterem example an NBD proxy/gateway could normaly just pipe
the data through. But now it would have to buffer large requests.

Besides eating up huge amounts of memory on the client side you get the
delay of having to wait for all the data again. Just on the other side
of the socket.

> Whilst in theory we'd need to signal EDONTKNOW support, actually
> large reads ( > BUFSIZ ) are pretty dodgy in that any error
> will cause a disconnect. Paul suggests we never get them anyway
> due to kernel request size limitations, though Wouter seems
> skeptical. So I am tempted just to put EDONTKNOW support
> into nbd-server and the kernel without any signalling. There
> cannot be many people using large reads reliably as prior to the
> last release there were full of all sorts of, um, interesting
> features.

I would just return E2BIG as error. If we don't handle out a maximum
request size beforehand then let the kernel split up requests
dynamically when it gets an E2BIG error. Obviously the server shouldn't
disconnect on E2BIG.

MfG
        Goswin



Reply to: