Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect

On 8/29/06, Wouter Verhelst <wouter@...3...> wrote:

On Tue, Aug 29, 2006 at 12:37:29PM -0400, Mike Snitzer wrote:
> With nbd-2.8.6, I'm seeing a strange situation where the machine
> running the nbd-client gets rebooted and the nbd-client tries to
> reconnect before the nbd-server acknowledges the fact that the
> original nbd-client was disconnected.  The second nbd-client
> connection comes in to the parent nbd-server and then hangs when
> fork()ing the child nbd-server to handle the new connection.
[...]
>
> Aug 28 18:51:31 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 18:51:31 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 18:51:31 host1 nbd_server[6422]: Authorized client
> Aug 28 18:51:31 host1 nbd_server[6424]: Starting to serve
> Aug 28 18:51:31 host1 nbd_server[6424]: size of exported file/device
> is 399988752384
>
> <nbd-client machine was rebooted, notice that the subsequent
> nbd-client start (after host1 reboot) does NOT succeed, nbd-server
> fork() fails as there is no ""Starting to serve", etc>
>
> Aug 28 19:56:02 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 19:56:02 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 19:56:02 host1 nbd_server[6422]: Authorized client
>
> <original nbd-server child process associated with the first
> nbd-client finally exits; BUT the nbd-server doesn't waitpid because
> its wedged; resulting in a zombie nbd-server>
>
> Aug 28 21:53:38 host1 nbd_server[6424]: Read failed: Connection reset by peer
>
> Unfortunately I don't have a trace of the nbd-server to know where the
> following serverloop() code fails:
>
>                         pid=g_malloc(sizeof(pid_t));
> #ifndef NOFORK
>                         if ((*pid=fork())<0) {
>                                 msg3(LOG_INFO,"Could not fork
> (%s)",strerror(errno)) ;
>                                 close(net) ;
>                                 continue ;
>                         }
>                         if (*pid>0) { /* parent */
>                                 close(net);
>                                 g_hash_table_insert(children, pid, pid);
>                                 continue;
>                         }
>                         /* child */
>                         g_hash_table_destroy(children);
>                         close(serve->socket) ;
> #endif // NOFORK
>
> Connecting to the parent nbd-server with gbd only yields the insight
> that it is wedged in libc waiting for a mutex:
>
> [root@...97... ~]# gdb /usr/local/bin/nbd-server 6422
> GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
> ...
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
> library "/lib64/tls/libthread_db.so.1".
>
> Attaching to program: /usr/local/bin/nbd-server, process 6422
> Reading symbols from /usr/lib64/libglib-2.0.so.0...done.
> Loaded symbols for /usr/lib64/libglib-2.0.so.0
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> #1  0x0000003e5062d500 in __GI__IO_wfile_jumps () from /lib64/tls/libc.so.6
> #2  0x0000000000000000 in ?? ()
>
> Is the mutex buried somewhere within a glib call (g_malloc or
> g_hash_table_insert)?

It would almost have to; I'm not calling it directly myself.

I've not yet put my finger on what is locking mutex; but this mutex is trying to be locked in the parent nbd-server when the nbd-SERVERS get hung. Some new information is that both the parent and the child nbd-server (serving one nbd-client connection) instances get hung when the nbd-client gets forcibly killed (reboot -nf on the nbd-client system).

Unfortunately the backtrace doesn't help much.

> Needless to say this nbd-server hang is _very_ bad and I'd appreciate
> any insight that might help track it down. Please let me know if you
> need any additional info or if you'd like me to try anything.

Can you reproduce it? If so, please try running it in 'strace -o foo -ff
nbd-server <normal arguments...>', which will produce a bunch of files,
one for each forked-off process, with strace info for that particular
process; it might provide insight.

Reproducing this is actually quite easy in my environment; just have heavy IO (e.g. MD resync) through the nbd-client and then reboot -nf that machine. As for strace'ing the nbd-server process; that was the first thing that was tried... unfortunately it slows the nbd-server down enough that this issue can't be reproduced.

Have you tried building an nbd-server with -DDODBG in CFLAGS?

Haven't tried that yet but here is a backtrace of where the child nbd-server is hung when this issue hits the fan:

(gdb) bt
#0 0x0000003e504b8a82 in __read_nocancel () from /lib64/tls/libc.so.6
#1 0x0000000000401ffd in readit (f=4, buf=0x7fffffc01440, len=28) at nbd-server.c:226
#2 0x0000000000402ea2 in mainloop (client=0x506a40) at nbd-server.c:668
#3 0x0000000000403533 in serveconnection (client=0x506a40) at nbd-server.c:824
#4 0x0000000000403c1d in serveloop (serve=0x506010) at nbd-server.c:1034
#5 0x0000000000403cd2 in main (argc=3, argv=0x7fffffc01698) at nbd-server.c:1077

So this says to me that the nbd-client died suddenly while the child nbd-server was performing a read and the nbd-server readit() is wedged.

Did you see this behaviour with previous versions of the server? If not,
I know where to look...

Haven't seen this issue with other versions BUT we're getting hangs in both the child and parent nbd-servers. SO I'm not sure if your gut on where to look was my select code but...

please advise, thanks.
Mike