Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
- To: Mike Snitzer <snitzer@...17...>
- Cc: nbd-general@lists.sourceforge.net
- Subject: Re: [Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect
- From: Wouter Verhelst <wouter@...3...>
- Date: Wed, 30 Aug 2006 01:15:48 +0200
- Message-id: <20060829231548.GA17396@...39...>
- In-reply-to: <170fa0d20608290937h59a6654fn320c2c0a2820ed5e@...18...>
- References: <170fa0d20608290937h59a6654fn320c2c0a2820ed5e@...18...>
On Tue, Aug 29, 2006 at 12:37:29PM -0400, Mike Snitzer wrote:
> With nbd-2.8.6, I'm seeing a strange situation where the machine
> running the nbd-client gets rebooted and the nbd-client tries to
> reconnect before the nbd-server acknowledges the fact that the
> original nbd-client was disconnected. The second nbd-client
> connection comes in to the parent nbd-server and then hangs when
> fork()ing the child nbd-server to handle the new connection.
[...]
>
> Aug 28 18:51:31 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 18:51:31 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 18:51:31 host1 nbd_server[6422]: Authorized client
> Aug 28 18:51:31 host1 nbd_server[6424]: Starting to serve
> Aug 28 18:51:31 host1 nbd_server[6424]: size of exported file/device
> is 399988752384
>
> <nbd-client machine was rebooted, notice that the subsequent
> nbd-client start (after host1 reboot) does NOT succeed, nbd-server
> fork() fails as there is no ""Starting to serve", etc>
>
> Aug 28 19:56:02 host1 nbd_server[6422]: connect from 192.168.14.30,
> assigned file is /dev/sdb
> Aug 28 19:56:02 host1 nbd_server[6422]: Can't open authorization file
> (null) (Bad address).
> Aug 28 19:56:02 host1 nbd_server[6422]: Authorized client
>
> <original nbd-server child process associated with the first
> nbd-client finally exits; BUT the nbd-server doesn't waitpid because
> its wedged; resulting in a zombie nbd-server>
>
> Aug 28 21:53:38 host1 nbd_server[6424]: Read failed: Connection reset by peer
>
> Unfortunately I don't have a trace of the nbd-server to know where the
> following serverloop() code fails:
>
> pid=g_malloc(sizeof(pid_t));
> #ifndef NOFORK
> if ((*pid=fork())<0) {
> msg3(LOG_INFO,"Could not fork
> (%s)",strerror(errno)) ;
> close(net) ;
> continue ;
> }
> if (*pid>0) { /* parent */
> close(net);
> g_hash_table_insert(children, pid, pid);
> continue;
> }
> /* child */
> g_hash_table_destroy(children);
> close(serve->socket) ;
> #endif // NOFORK
>
> Connecting to the parent nbd-server with gbd only yields the insight
> that it is wedged in libc waiting for a mutex:
>
> [root@...97... ~]# gdb /usr/local/bin/nbd-server 6422
> GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
> ...
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
> library "/lib64/tls/libthread_db.so.1".
>
> Attaching to program: /usr/local/bin/nbd-server, process 6422
> Reading symbols from /usr/lib64/libglib-2.0.so.0...done.
> Loaded symbols for /usr/lib64/libglib-2.0.so.0
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
> #1 0x0000003e5062d500 in __GI__IO_wfile_jumps () from /lib64/tls/libc.so.6
> #2 0x0000000000000000 in ?? ()
>
> Is the mutex buried somewhere within a glib call (g_malloc or
> g_hash_table_insert)?
It would almost have to; I'm not calling it directly myself.
Unfortunately the backtrace doesn't help much.
> Needless to say this nbd-server hang is _very_ bad and I'd appreciate
> any insight that might help track it down. Please let me know if you
> need any additional info or if you'd like me to try anything.
Can you reproduce it? If so, please try running it in 'strace -o foo -ff
nbd-server <normal arguments...>', which will produce a bunch of files,
one for each forked-off process, with strace info for that particular
process; it might provide insight.
Have you tried building an nbd-server with -DDODBG in CFLAGS?
Did you see this behaviour with previous versions of the server? If not,
I know where to look...
--
<Lo-lan-do> Home is where you have to wash the dishes.
-- #debian-devel, Freenode, 2004-09-22
Reply to: