[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock_xmit() is working on a receive

Mike Snitzer wrote:

In practice this looks like:

nbd1: Send control failed (result -32)
end_request: I/O error, dev nbd1, sector 0
end_request: I/O error, dev nbd1, sector 8032264
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on nbd1, disabling device.
        Operation continuing on 1 devices
Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
 [<ffffffff88b1e125>] :nbd:sock_xmit+0x9d/0x301

The fact that sock_xmit() in receive mode is unprotected seems to be
the WHY a NULL pointer is possible; but I'm still trying to identify
the HOW.

Do you know who is setting the socket NULL? Is it already NULL when you get to this point? Is it the nbd-client -d? Is it the original nbd-client/kernel that does it? Figuring that out would help narrow down the cause.

But for me this begs the question:  why isn't the nbd_device's socket
always protected during sock_xmit() for both
transmits and receives; rather than just transmits (via tx_lock)!?

It would deadlock if we held the lock over both. Generally we don't have to worry about receives, since they're always done in the nbd-client process, so we have control over when and how it exits and cleans up. The odd case, as you've discovered, is when another process (nbd-client -d) comes along and starts mucking with the queue and socket. Would "kill -9 <nbd-client-pid>" work for you instead? That is what I use to break the connection, and it's safe, as it tells the original nbd-client to exit (which it does cleanly and safely).


Reply to: