Re: [Nbd] Bug with large reads and protocol issue
- To: Goswin von Brederlow <goswin-v-b@...186...>, nbd-general@lists.sourceforge.net
- Subject: Re: [Nbd] Bug with large reads and protocol issue
- From: Alex Bligh <alex@...872...>
- Date: Fri, 03 Jun 2011 16:43:26 +0100
- Message-id: <1A3272F13ED09C17872AE433@...874...>
- Reply-to: Alex Bligh <alex@...872...>
- In-reply-to: <87wrh35gph.fsf@...860...>
- References: <9C86247514F89CE6A7B95C36@...908...> <87wrh35gph.fsf@...860...>
Goswin,
Does that really make a difference past e.g. 128k chunks if pipelining
is allowed? 1MB chunks? 10MB chunks? I think the idea of the server
communicating a max request size to the client and sticking to that is
the best here.
Sometimes, yes. Especially if your underlying block size is always 1MB or
greater (like mine is). Ignoring caching (which sometimes I have to do), I
have to do at between 8 and 64 times as many high latency writes at 128K
block size.
On a normal disk: probably not a huge amount, but I'll bet you see some.
Certainly you would want your largest block size to exceed the size of the
writeback cache in the device. On a noral hard disk that could be a fair
amount larger than 128K. On a SAN, it's huge (I'm playing with one at the
moment with multigigabyte write caches).
IE if nbd-server is running on normal PC hardware with local disks,
you won't see much difference, which is probably why it is the
way it is at the moment.
One answer to this is "don't use large reads, then". However,
in certain situations (e.g. servers than can parallelize
requests), it's far more efficient to do larger reads.
Even now, we wait until a large amount of data has been read
before sending any.
Huh? A large read is better because the kernel can do better ordering of
the requests (in case the storage backend is fragmented) or do better
read ahead.
I know that. I meant one answer is "don't use large reads" because
(as you point out) in many scenarios they don't buy you much.
But if the server can parallelize requests the a set of small reads for
the same chunk will all end up in the kernel space in parallel result in
basically the same performance (hopefully). Small reads will cost some
syscalls but that should be neglible against the disk speed.
This is not true if you are parellising on a huge block basis because
the linux block layer knows nothing about the arrangement of the huge
underlying blocks. So the requests will tend to cross block boundaries.
So I wouldn't go so far as saying "far more efficient" without measuring
it first.
What makes you think I haven't measured it? :-) (not with nbd,
yet)
Given that errors are really unlikely in the great scheme of
things, a relatively low overhead solution to this would be
to send the read followed by the error code (again) (we could
signal this by returning "EDONTKNOW" or something in the
original error field). If this was non-zero, the client would
discard all the data and use this as the error code.
This would waste 4 octets on every read reply where EDONTKNOW
was used, which would solely be large read requests. Obviously
as EDONTKNOW would be sent at the end of a large read, if there
is an early error, we'd have to send a large amount of junk
over TCP in the event of an error, but this is hardly
a problem.
And fill the failed data with 0 bytes?
Fill the failed data with anything you like. It's a read, and
the read fails atomically, so all the data would be discarded.
That scheme would mean the client would have to buffer the complete read
in memory and couldn't streamline it to the application as it comes
in. For an exterem example an NBD proxy/gateway could normaly just pipe
the data through. But now it would have to buffer large requests.
The bio is already allocated isn't it? Doesn't it just copy it
into the bio as usual? IE it works exactly how large reads
currently work.
Besides eating up huge amounts of memory on the client side you get the
delay of having to wait for all the data again. Just on the other side
of the socket.
I think you mean that the client can't complete parts of the request
as the data comes in, because there might be an eventual error.
This would be a problem except for the fact that you can't
do this anyway (you either complete a request or don't). The
atomic unit of a request is the whole request. And you can write
straight where you would write eventually (as indeed the client
does at the moment).
I would just return E2BIG as error. If we don't handle out a maximum
request size beforehand then let the kernel split up requests
dynamically when it gets an E2BIG error. Obviously the server shouldn't
disconnect on E2BIG.
See above for why I don't want to do that. That's also not what nbd
does at the moment: it handles large blocks (except it goes mad on an
error).
--
Alex Bligh
Reply to: