[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] Fwd: nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock_xmit() is working on a receive



FYI, I used the wrong mailing list address in my original mail.


---------- Forwarded message ----------
From: Mike Snitzer <snitzer@...17...>
Date: Wed, Mar 26, 2008 at 2:43 PM
Subject: nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while
sock_xmit() is working on a receive
To: Paul Clements <paul.clements@...124...>
Cc: nbd-general-request@lists.sourceforge.net, linux-kernel@...25...


I'm seeing that nbd_device's socket is getting set to NULL in the
 middle of nbd_read_stat()'s sock_xmit().

 There appears to be a race where 'nbd-client -d' requests that an NBD
 device first disconnect from the nbd-server (via NBD_DISCONNECT ioctl)
 and then set the NBD device's socket to NULL, etc (via
 NBD_CLEAR_SOCK).

 Both NBD_DISCONNECT and NBD_CLEAR_SOCK take the nbd_device's tx_lock
 (which protects the socket during transmits) _but_ for receives the
 socket can be set to NULL (via NBD_CLEAR_SOCK) at any time while
 inside sock_xmit(); as such NBD_CLEAR_SOCK can cause a NULL pointer in
 sock_xmit().

 Analyzing the crash it is clear that the NULL pointer comes when
 sock_xmit()'s do {} while() dereferences the nbd_device's socket with:
 sock->sk->sk_allocation = GFP_NOIO;
 I also saw that the sock_xmit() caller is nbd_read_stat().

 The sequence looks like this:

 nbd1: NBD_DISCONNECT
 [NOTE: a sock_xmit() send attempt is made on behalf of NBD_DISCONNECT]
 nbd1: Send control failed (result -32)
 ...
 [NBD is still dequeueing requests]
 ...
 Race: [NBD_CLEAR_SOCK ioctl][FATAL: nbd_read_stat()'s sock_xmit()
 receive attempt causes NULL pointer]

 In practice this looks like:

 nbd1: NBD_DISCONNECT
 nbd1: Send control failed (result -32)
 end_request: I/O error, dev nbd1, sector 0
 end_request: I/O error, dev nbd1, sector 8032264
 md: super_written gets error=-5, uptodate=0
 raid1: Disk failure on nbd1, disabling device.
        Operation continuing on 1 devices
 Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
  [<ffffffff88b1e125>] :nbd:sock_xmit+0x9d/0x301

 The fact that sock_xmit() in receive mode is unprotected seems to be
 the WHY a NULL pointer is possible; but I'm still trying to identify
 the HOW.

 But for me this begs the question:  why isn't the nbd_device's socket
 always protected during sock_xmit() for both
 transmits and receives; rather than just transmits (via tx_lock)!?

 Any help on the "right" fix would be appreciated, thanks.
 Mike



Reply to: