[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] nbd-server 2.8.6 hangs on nbd-client reconnect



With nbd-2.8.6, I'm seeing a strange situation where the machine
running the nbd-client gets rebooted and the nbd-client tries to
reconnect before the nbd-server acknowledges the fact that the
original nbd-client was disconnected.  The second nbd-client
connection comes in to the parent nbd-server and then hangs when
fork()ing the child nbd-server to handle the new connection.

Aug 28 18:51:31 host1 nbd_server[6422]: connect from 192.168.14.30,
assigned file is /dev/sdb
Aug 28 18:51:31 host1 nbd_server[6422]: Can't open authorization file
(null) (Bad address).
Aug 28 18:51:31 host1 nbd_server[6422]: Authorized client
Aug 28 18:51:31 host1 nbd_server[6424]: Starting to serve
Aug 28 18:51:31 host1 nbd_server[6424]: size of exported file/device
is 399988752384

<nbd-client machine was rebooted, notice that the subsequent
nbd-client start (after host1 reboot) does NOT succeed, nbd-server
fork() fails as there is no ""Starting to serve", etc>

Aug 28 19:56:02 host1 nbd_server[6422]: connect from 192.168.14.30,
assigned file is /dev/sdb
Aug 28 19:56:02 host1 nbd_server[6422]: Can't open authorization file
(null) (Bad address).
Aug 28 19:56:02 host1 nbd_server[6422]: Authorized client

<original nbd-server child process associated with the first
nbd-client finally exits; BUT the nbd-server doesn't waitpid because
its wedged; resulting in a zombie nbd-server>

Aug 28 21:53:38 host1 nbd_server[6424]: Read failed: Connection reset by peer


Unfortunately I don't have a trace of the nbd-server to know where the
following serverloop() code fails:

                       pid=g_malloc(sizeof(pid_t));
#ifndef NOFORK
                       if ((*pid=fork())<0) {
                               msg3(LOG_INFO,"Could not fork
(%s)",strerror(errno)) ;
                               close(net) ;
                               continue ;
                       }
                       if (*pid>0) { /* parent */
                               close(net);
                               g_hash_table_insert(children, pid, pid);
                               continue;
                       }
                       /* child */
                       g_hash_table_destroy(children);
                       close(serve->socket) ;
#endif // NOFORK

Connecting to the parent nbd-server with gbd only yields the insight
that it is wedged in libc waiting for a mutex:

[root@...97... ~]# gdb /usr/local/bin/nbd-server 6422
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
...
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/local/bin/nbd-server, process 6422
Reading symbols from /usr/lib64/libglib-2.0.so.0...done.
Loaded symbols for /usr/lib64/libglib-2.0.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003e504d16ab in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6
#1  0x0000003e5062d500 in __GI__IO_wfile_jumps () from /lib64/tls/libc.so.6
#2  0x0000000000000000 in ?? ()

Is the mutex buried somewhere within a glib call (g_malloc or
g_hash_table_insert)?

Needless to say this nbd-server hang is _very_ bad and I'd appreciate
any insight that might help track it down.  Please let me know if you
need any additional info or if you'd like me to try anything.

thanks,
Mike



Reply to: