[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect

On 8/31/06, Mike Snitzer <snitzer@...17...> wrote:

 I've not yet put my finger on what is locking mutex; but this mutex is
trying to be locked in the parent nbd-server when the nbd-SERVERS get hung.
Some new information is that both the parent and the child nbd-server
(serving one nbd-client connection) instances get hung when the nbd-client
gets forcibly killed (reboot -nf on the nbd-client system).

Please disregard the noise (gdb backtrace et al) I sent yesterday
related to the child nbd-server blocking in __read_nocancel().  This
is simply because the child nbd-server's socket used for reading from
the nbd-client is a blocking socket (NOTE: parent nbd-server's
non-blocking socket becomes a blocking socket for the child nbd-server
once the nbd-client connection is accept()'d).

When the nbd-client side socket that is connected to the child
nbd-server's socket is forcibly killed (e.g. hardware or power
failure; or in my case a forced reboot -nf) any outstanding read over
the child nbd-server's socket will block (for ~2 hours).  You can use
the nbd-server's --idle-time argument to have the hung child
nbd-server exit.  After quite a bit of testing the existence (or lack
thereof) of the hung child nbd-server seems to have no bearing on the
ability of the parent nbd-server to accommodate new nbd-client

So taking a step back.  It is clear that the major bug with the
nbd-server is that the parent nbd-server hangs in
__lll_mutex_lock_wait() after the next nbd-client connection attempt
following the nbd-client (nbd kernel sockets) having been forcibly
killed.  All attempts to interrogate the nbd-server (gdb or strace)
will mask this issue.

> Did you see this behaviour with previous versions of the server? If not,
> I know where to look...

SO I'm not sure if your gut on where to look was my select code but...

FYI, I've verified this problem with both 2.8.5 (accept-based) and
2.8.6 (select-based).  It is clear that the nbd-server _seems_
perfectly fine (either blocked in accept or select) waiting for an
nbd-client connection.  When the nbd-client connection is made without
gdb or strace attached the nbd-server fails the connection and then
promptly wedges itself trying to get a mutex.

So the big question for me right now is: why does the abrupt
disappearance of the remote nbd-client socket connection to the child
nbd-server's socket have any bearing on the health/ability of the
parent nbd-server to accept and negotiate a new nbd-client connection?
Clearly some transient state isn't getting cleaned up on the
nbd-server when an nbd-client connection drops abruptly.

BTW, there really isn't an issue with the IO load needing to be
excessively high in order to reproduce this; simply copying a linux
source tree over nbd would suffice.

I'm not any closer to understanding where/why/what is causing
__lll_mutex_lock_wait to even be called within the nbd-server.  The
nbd-server is hung waiting for this mutex but the gdb backtrace is
truncated/useless like I showed earlier in this thread.  So is there
just some weird corruption occurring?

I can reliably reproduce this issue and welcome any suggestions; I'll
try to get traces of the nbd-client when it fails and also compile
nbd-server with DDODBG.

I'll report back if I find anything but any assistance would be appreciated.


Reply to: