[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect



Hi Mike,

On Thu, Aug 31, 2006 at 06:02:28PM -0400, Mike Snitzer wrote:
> Haven't tried that yet but here is a backtrace of where the child nbd-server
> is hung when this issue hits the fan:
> 
> (gdb) bt
> #0  0x0000003e504b8a82 in __read_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000000000401ffd in readit (f=4, buf=0x7fffffc01440, len=28) at
> nbd-server.c:226
> #2  0x0000000000402ea2 in mainloop (client=0x506a40) at nbd-server.c:668
> #3  0x0000000000403533 in serveconnection (client=0x506a40) at nbd-server.c
> :824
> #4  0x0000000000403c1d in serveloop (serve=0x506010) at nbd-server.c:1034
> #5  0x0000000000403cd2 in main (argc=3, argv=0x7fffffc01698) at nbd-server.c
> :1077
> 
> So this says to me that the nbd-client died suddenly while the child
> nbd-server was performing a read and the nbd-server readit() is wedged.

I had some inspiration recently regarding this Heisenbug.

The case is so that the listening server socket has O_NONBLOCK set,
while we don't repeat that for the sockets created with accept() on that
socket. As such, any read on a socket will block if the client is
forcibly disconnected, which is also what we see in the above backtrace.
I knew that, and I fully expected the child nbd-server to hang when this
would happen; but since the parent nbd-server has a different socket
which is not connected to a client, I didn't think this was related in
any way. However, it occurred to me that this is not necessarily true;
both sockets are somewhat related to eachother (if only because they
listen on the same port), so I can imagine there being some lock when
blocking IO happens on one of the sockets but not the other.

For that reason, I just committed a change to the trunk NBD server that
will make the child socket be non-blocking as well. I would appreciate
it if you could test this change, since I still cannot reliably
reproduce it (I have been able to do this on a few occasions, but
somehow it doesn't happen to me every time). I also created a "nightly"
tarball, you'll find it at http://nbd.sourceforge.net/nbd-SVN.tar.gz
(and .bz2). Note that this is a 2.9.2 prerelease, not a 2.8.8.

Occasionally, this change also seems to improve performance slightly,
although I'm not entirely sure about that bit.

If this fixes it, I'll commit the same change to the 2.8 branch and
release one there, too. Here's for hoping.

-- 
<Lo-lan-do> Home is where you have to wash the dishes.
  -- #debian-devel, Freenode, 2004-09-22



Reply to: