Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect

To: Mike Snitzer <snitzer@...17...>
Cc: nbd-general@lists.sourceforge.net
Subject: Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
From: Wouter Verhelst <wouter@...3...>
Date: Wed, 30 Aug 2006 01:15:48 +0200
Message-id: <20060829231548.GA17396@...39...>
In-reply-to: <170fa0d20608290937h59a6654fn320c2c0a2820ed5e@...18...>
References: <170fa0d20608290937h59a6654fn320c2c0a2820ed5e@...18...>

On Tue, Aug 29, 2006 at 12:37:29PM -0400, Mike Snitzer wrote:
> With nbd-2.8.6, I'm seeing a strange situation where the machine
> running the nbd-client gets rebooted and the nbd-client tries to
> reconnect before the nbd-server acknowledges the fact that the
> original nbd-client was disconnected.  The second nbd-client
> connection comes in to the parent nbd-server and then hangs when
> fork()ing the child nbd-server to handle the new connection.
[...]
> 
> Aug 28 18:51:31 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 18:51:31 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 18:51:31 host1 nbd_server[6422]: Authorized client
> Aug 28 18:51:31 host1 nbd_server[6424]: Starting to serve
> Aug 28 18:51:31 host1 nbd_server[6424]: size of exported file/device
> is 399988752384
> 
> <nbd-client machine was rebooted, notice that the subsequent
> nbd-client start (after host1 reboot) does NOT succeed, nbd-server
> fork() fails as there is no ""Starting to serve", etc>
> 
> Aug 28 19:56:02 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 19:56:02 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 19:56:02 host1 nbd_server[6422]: Authorized client
> 
> <original nbd-server child process associated with the first
> nbd-client finally exits; BUT the nbd-server doesn't waitpid because
> its wedged; resulting in a zombie nbd-server>
> 
> Aug 28 21:53:38 host1 nbd_server[6424]: Read failed: Connection reset by peer
> 
> Unfortunately I don't have a trace of the nbd-server to know where the
> following serverloop() code fails:
> 
>                         pid=g_malloc(sizeof(pid_t));
> #ifndef NOFORK
>                         if ((*pid=fork())<0) {
>                                 msg3(LOG_INFO,"Could not fork
> (%s)",strerror(errno)) ;
>                                 close(net) ;
>                                 continue ;
>                         }
>                         if (*pid>0) { /* parent */
>                                 close(net);
>                                 g_hash_table_insert(children, pid, pid);
>                                 continue;
>                         }
>                         /* child */
>                         g_hash_table_destroy(children);
>                         close(serve->socket) ;
> #endif // NOFORK
> 
> Connecting to the parent nbd-server with gbd only yields the insight
> that it is wedged in libc waiting for a mutex:
> 
> [root@...97... ~]# gdb /usr/local/bin/nbd-server 6422
> GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
> ...
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
> library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/local/bin/nbd-server, process 6422
> Reading symbols from /usr/lib64/libglib-2.0.so.0...done.
> Loaded symbols for /usr/lib64/libglib-2.0.so.0
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> #1  0x0000003e5062d500 in __GI__IO_wfile_jumps () from /lib64/tls/libc.so.6
> #2  0x0000000000000000 in ?? ()
> 
> Is the mutex buried somewhere within a glib call (g_malloc or
> g_hash_table_insert)?

It would almost have to; I'm not calling it directly myself.

Unfortunately the backtrace doesn't help much.

> Needless to say this nbd-server hang is _very_ bad and I'd appreciate
> any insight that might help track it down.  Please let me know if you
> need any additional info or if you'd like me to try anything.

Can you reproduce it? If so, please try running it in 'strace -o foo -ff
nbd-server <normal arguments...>', which will produce a bunch of files,
one for each forked-off process, with strace info for that particular
process; it might provide insight.

Have you tried building an nbd-server with -DDODBG in CFLAGS?

Did you see this behaviour with previous versions of the server? If not,
I know where to look...

-- 
<Lo-lan-do> Home is where you have to wash the dishes.
  -- #debian-devel, Freenode, 2004-09-22

Reply to:

Follow-Ups:
- Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
  - From: "Mike Snitzer" <snitzer@...17...>

References:
- [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
  - From: "Mike Snitzer" <snitzer@...17...>

Prev by Date: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
Next by Date: Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
Previous by thread: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
Next by thread: Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
Index(es):
- Date
- Thread