[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock_xmit() is working on a receive

On Thu, Mar 27, 2008 at 8:35 AM, Paul Clements
<paul.clements@...124...> wrote:
> Mike Snitzer wrote:
>  > In practice this looks like:
>  >
>  > nbd1: Send control failed (result -32)
>  > end_request: I/O error, dev nbd1, sector 0
>  > end_request: I/O error, dev nbd1, sector 8032264
>  > md: super_written gets error=-5, uptodate=0
>  > raid1: Disk failure on nbd1, disabling device.
>  >         Operation continuing on 1 devices
>  > Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
>  >  [<ffffffff88b1e125>] :nbd:sock_xmit+0x9d/0x301
>  > The fact that sock_xmit() in receive mode is unprotected seems to be
>  > the WHY a NULL pointer is possible; but I'm still trying to identify
>  > the HOW.
>  Do you know who is setting the socket NULL? Is it already NULL when you
>  get to this point? Is it the nbd-client -d? Is it the original
>  nbd-client/kernel that does it? Figuring that out would help narrow down
>  the cause.

I believe that NBD_CLEAR_SOCK from 'nbd-client -d' sets it to NULL.
lo->sock is already NULL on entry to sock_xmit().

So simply checking if the sock_xmit's 'sock' is NULL _should_ avoid
any possibility of a NULL pointer Oops because sock can't be !NULL
after the negative check (because of the sock = lo->sock assignment).
That is, unless I'm missing somewhere in the rest of the kernel (not
nbd) that would take action to set a socket to NULL?

The attached patch seems reasonable.  I'll be testing today to verify
it fixes the problem.

>  > But for me this begs the question:  why isn't the nbd_device's socket
>  > always protected during sock_xmit() for both
>  > transmits and receives; rather than just transmits (via tx_lock)!?
>  It would deadlock if we held the lock over both. Generally we don't have
>  to worry about receives, since they're always done in the nbd-client
>  process, so we have control over when and how it exits and cleans up.
>  The odd case, as you've discovered, is when another process (nbd-client
>  -d) comes along and starts mucking with the queue and socket. Would
>  "kill -9 <nbd-client-pid>" work for you instead? That is what I use to
>  break the connection, and it's safe, as it tells the original nbd-client
>  to exit (which it does cleanly and safely).

I'm aware tx_lock can't be held over both; I was suggesting maybe
another lock but that feels like overkill.

I use 'nbd-client -d' and then resort to 'kill -9' IFF 'nbd-client -d'
returned non-zero.
But it sounds like simply using 'kill -9' could be a near-term
workaround, I'll try this as well and will report back.

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b53fdb0..58f77b3 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -153,6 +153,12 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
 	struct kvec iov;
 	sigset_t blocked, oldset;
+	if (unlikely(!sock)) {
+		printk(KERN_ERR "%s: Attempted %s on closed socket in sock_xmit\n",
+		       lo->disk->disk_name, (send ? "send" : "recv"));
+		return -EINVAL;
+	}
 	/* Allow interception of SIGKILL only
 	 * Don't allow other signals to interrupt the transmission */
 	siginitsetinv(&blocked, sigmask(SIGKILL));

Reply to: