Re: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)

To: Hermann Lauer <Hermann.Lauer@...1489...>
Cc: nbd-general@lists.sourceforge.net
Subject: Re: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
From: Michal Belczyk <belczyk@...1274...>
Date: Fri, 14 Nov 2014 17:52:19 +0100
Message-id: <20141114165219.GD28731@...1273...>
In-reply-to: <20141114161117.GC15362@...1490...>
References: <20140210120607.GI19189@...1490...> <20140211222008.GE95782@...1273...> <20141114161117.GC15362@...1490...>

On Fri, Nov 14, 2014 at 05:11:17PM +0100, Hermann Lauer wrote:
> Dear nbd kernel maintainers,
> 
> On Tue, Feb 11, 2014 at 11:20:08PM +0100, Michal Belczyk wrote:
> > > according to a mail from 21 July 2011 (http://comments.gmane.org/gmane.linux.drivers.nbd.general/985)
> > > this is due to the ioctl(nbd, NBD_SET_TIMEOUT, timeout) not working.
> > > 
> > > Here on debian wheezy (Linux www1 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux) a disconnection of the server host yields hanging
> > > of any "cat /proc/mdstat" on the raid1 on top of the nbd device
> > > and to the appended oops (only two shown) of the kernel. Only
> > > bringing back the nbd server allows further commands on the md device.
> > > 
> > > Any news/ideas on that subject ?
> > 
> > No code has been pushed upstream to fix it since then...
> > 
> > You may want to take a look here:
> > 
> >   https://www.mail-archive.com/nbd-general@lists.sourceforge.net/msg01513.html
> >   https://www.mail-archive.com/nbd-general@lists.sourceforge.net/msg01549.html
> > 
> > I will post the patch against the most recent tip later this week.
> 
> reinvestigated this issue on debian jessie (kernel 3.16-3-amd64 #1 SMP Debian 3.16.5-1)
> and still getting hangs - has the mentioned patch made his way upstream ?
> 
> Looking at the 3.16.1 vanilla source at a first glance looks like it's not there - and
> I seem to have no version which applys cleanly.

IIRC this required some refactor for post 3.12 or 3.13 kernels...
Please try the attached patch.
This is the version rebased against the current tip, completely
untested.

-- 
Michal Belczyk Sr.

commit 64bbb436f8a22ca792c442cde41f7cf727f5fc8b
Author: Michal Belczyk <belczyk@...1274...>
Date:   Sun Jun 16 23:08:42 2013 +0100

    nbd: improve request timeouts handling
    
    The main idea behind it is to be able to quickly detect broken replica
    and switch over to another when used with any sort of mirror type device
    built on top of any number of nbd devices.
    
    Before this change a request would time out causing the socket to be
    shut down and the device to fail in case of a dead server or removed
    network cable only if:
    
      a) either the timer around kernel_sendmsg() kicked in
      b) or the TCP failures on retransmission finally caused an error
         on the socket, likely blocked on kernel_recvmsg() at this time,
         waiting for replies from the server
    
    Case a) depends mostly on the size of requests issued and on the maximum
    size of the socket buffer -- a lot of read request headers or small
    write requests could be "sent" without triggering the requested timeout
    
    Case b) timeout is independent of nbd-client -t <timeout> option
    as there is no TCP_USER_TIMEOUT set on the client socket by default.
    And even if such timeout was set it would not solve the problem of
    an nbd-client hung on receiving replies for much longer time without
    setting TCP keep-alives... and that would be the third, independent
    timeout setting required to make it work "almost" as expected...
    
    So, instead, take the big hammer approach and:
    
      *) trace the number of outstanding requests sent to the server
         (nbd->inflight)
      *) enable the timer (nbd->req_timer) before the first request
         is submitted and leave it enabled
      *) when sending next request do not touch the timer (it is up
         to the receiving side to control it at this point)
      *) on receive side update the timer every time a response
         is collected but there are more to read from the server
      *) disable the timer whenever the inflight counter drops
         to zero or an error (leading to the socket shutdown)
         is returned
    
    This patch does NOT prevent the server to process a request for longer
    than the timeout specified if only it replies to any other request
    submitted within the timeout (the server still may reply to a batch
    of requests in any order).
    
    Only the nbd->xmit_timeout != 0 code path is changed so the patch should
    not affect nbd connections running without an explicit timeout set
    on the nbd-client command line.
    
    There is also no way to enable or disable the timeout on an active
    (nbd->pid != 0) nbd device, it is however possible to change its value.
    Otherwise the inflight request counter would have to affect the nbd
    devices enabled without nbd-client -t <timeout>.
    
    Also move nbd->pid modifications behind nbd->tx_lock wherever possible
    to avoid races between the concurrent nbd-client invocations.

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4bc2a5c..a358a7a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int lock)
 
 static void nbd_xmit_timeout(unsigned long arg)
 {
-	struct task_struct *task = (struct task_struct *)arg;
+	struct nbd_device *nbd = (struct nbd_device *)arg;
+	struct task_struct *task_ary[2];
+	unsigned long flags;
+	int i;
 
-	printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
-		task->comm, task->pid);
-	force_sig(SIGKILL, task);
+	spin_lock_irqsave(&nbd->timer_lock, flags);
+	nbd->timedout = 1;
+	task_ary[0] = nbd->sender;
+	task_ary[1] = nbd->receiver;
+	for (i = 0; i < 2; i++) {
+		if (task_ary[i] == NULL)
+			continue;
+		printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
+			task_ary[i]->comm, task_ary[i]->pid);
+		force_sig(SIGKILL, task_ary[i]);
+		break;
+	}
+	spin_unlock_irqrestore(&nbd->timer_lock, flags);
 }
 
 /*
@@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 	struct msghdr msg;
 	struct kvec iov;
 	sigset_t blocked, oldset;
-	unsigned long pflags = current->flags;
+	unsigned long flags, pflags = current->flags;
 
 	if (unlikely(!sock)) {
 		dev_err(disk_to_dev(nbd->disk),
@@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 		msg.msg_controllen = 0;
 		msg.msg_flags = msg_flags | MSG_NOSIGNAL;
 
-		if (send) {
-			struct timer_list ti;
-
-			if (nbd->xmit_timeout) {
-				init_timer(&ti);
-				ti.function = nbd_xmit_timeout;
-				ti.data = (unsigned long)current;
-				ti.expires = jiffies + nbd->xmit_timeout;
-				add_timer(&ti);
+		if (nbd->xmit_timeout) {
+			spin_lock_irqsave(&nbd->timer_lock, flags);
+			if (nbd->timedout) {
+				spin_unlock_irqrestore(&nbd->timer_lock, flags);
+				printk(KERN_WARNING
+					"nbd (pid %d: %s) timed out\n",
+					task_pid_nr(current), current->comm);
+				result = -EINTR;
+				sock_shutdown(nbd, !send);
+				break;
 			}
+			if (send)
+				nbd->sender = current;
+			else
+				nbd->receiver = current;
+			spin_unlock_irqrestore(&nbd->timer_lock, flags);
+		}
+
+		if (send)
 			result = kernel_sendmsg(sock, &msg, &iov, 1, size);
-			if (nbd->xmit_timeout)
-				del_timer_sync(&ti);
-		} else
+		else
 			result = kernel_recvmsg(sock, &msg, &iov, 1, size,
 						msg.msg_flags);
 
+		if (nbd->xmit_timeout) {
+			spin_lock_irqsave(&nbd->timer_lock, flags);
+			if (send)
+				nbd->sender = NULL;
+			else
+				nbd->receiver = NULL;
+			spin_unlock_irqrestore(&nbd->timer_lock, flags);
+		}
+
 		if (signal_pending(current)) {
 			siginfo_t info;
 			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
@@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 }
 
 static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
-		int flags)
+		int msg_flags)
 {
 	int result;
 	void *kaddr = kmap(bvec->bv_page);
 	result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
-			   bvec->bv_len, flags);
+			   bvec->bv_len, msg_flags);
 	kunmap(bvec->bv_page);
 	return result;
 }
@@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
 /* always call with the tx_lock held */
 static int nbd_send_req(struct nbd_device *nbd, struct request *req)
 {
-	int result, flags;
+	int result, msg_flags;
 	struct nbd_request request;
-	unsigned long size = blk_rq_bytes(req);
+	unsigned long flags, size = blk_rq_bytes(req);
 
 	memset(&request, 0, sizeof(request));
 	request.magic = htonl(NBD_REQUEST_MAGIC);
@@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
 	}
 	memcpy(request.handle, &req, sizeof(req));
 
+	if (nbd->xmit_timeout) {
+		spin_lock_irqsave(&nbd->timer_lock, flags);
+		if (!nbd->inflight) {
+			nbd->req_timer.function = nbd_xmit_timeout;
+			nbd->req_timer.data = (unsigned long)nbd;
+			nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
+			add_timer(&nbd->req_timer);
+		}
+		nbd->inflight++;
+		BUG_ON(nbd->inflight <= 0);
+		spin_unlock_irqrestore(&nbd->timer_lock, flags);
+	}
+
 	dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
 			nbd->disk->disk_name, req,
 			nbdcmd_to_ascii(nbd_cmd(req)),
@@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
 		 * whether to set MSG_MORE or not...
 		 */
 		rq_for_each_segment(bvec, req, iter) {
-			flags = 0;
+			msg_flags = 0;
 			if (!rq_iter_last(bvec, iter))
-				flags = MSG_MORE;
+				msg_flags = MSG_MORE;
 			dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n",
-					nbd->disk->disk_name, req, bvec.bv_len);
-			result = sock_send_bvec(nbd, &bvec, flags);
+					nbd->disk->disk_name, req, bvec->bv_len);
+			result = sock_send_bvec(nbd, &bvec, msg_flags);
 			if (result <= 0) {
 				dev_err(disk_to_dev(nbd->disk),
 					"Send data failed (result %d)\n",
@@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
 	return 0;
 
 error_out:
+	if (nbd->xmit_timeout) {
+		spin_lock_irqsave(&nbd->timer_lock, flags);
+		nbd->inflight--;
+		BUG_ON(nbd->inflight < 0);
+		if (!nbd->inflight)
+			del_timer_sync(&nbd->req_timer);
+		spin_unlock_irqrestore(&nbd->timer_lock, flags);
+	}
 	return -EIO;
 }
 
@@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
 static int nbd_do_it(struct nbd_device *nbd)
 {
 	struct request *req;
+	unsigned long flags;
 	int ret;
 
 	BUG_ON(nbd->magic != NBD_MAGIC);
 
 	sk_set_memalloc(nbd->sock->sk);
-	nbd->pid = task_pid_nr(current);
 	ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (ret) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
-		nbd->pid = 0;
 		return ret;
 	}
 
-	while ((req = nbd_read_stat(nbd)) != NULL)
-		nbd_end_request(req);
+	for (;;) {
+		req = nbd_read_stat(nbd);
+		if (nbd->xmit_timeout) {
+			spin_lock_irqsave(&nbd->timer_lock, flags);
+			if (req != NULL) {
+				nbd->inflight--;
+				BUG_ON(nbd->inflight < 0);
+			}
+			if (req != NULL && nbd->inflight)
+				mod_timer(&nbd->req_timer,
+					  jiffies + nbd->xmit_timeout);
+			else
+				del_timer_sync(&nbd->req_timer);
+			spin_unlock_irqrestore(&nbd->timer_lock, flags);
+		}
+		if (req != NULL) {
+			nbd_end_request(req);
+			continue;
+		}
+		break;
+	}
 
 	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
-	nbd->pid = 0;
 	return 0;
 }
 
@@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		set_capacity(nbd->disk, nbd->bytesize >> 9);
 		return 0;
 
-	case NBD_SET_TIMEOUT:
-		nbd->xmit_timeout = arg * HZ;
+	case NBD_SET_TIMEOUT: {
+		int xt;
+
+		xt = arg * HZ;
+		if (xt < 0)
+			return -EINVAL;
+		if (nbd->pid &&
+		    ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
+			return -EBUSY;
+		dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
+			nbd->xmit_timeout / HZ, xt / HZ);
+		nbd->xmit_timeout = xt;
 		return 0;
+	}
 
 	case NBD_SET_FLAGS:
 		nbd->flags = arg;
@@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		if (!nbd->sock)
 			return -EINVAL;
 
+		nbd->pid = task_pid_nr(current);
+		nbd->inflight = 0;
+		nbd->timedout = 0;
+		nbd->sender = NULL;
+		nbd->receiver = NULL;
 		mutex_unlock(&nbd->tx_lock);
 
 		if (nbd->flags & NBD_FLAG_READ_ONLY)
@@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 					nbd->disk->disk_name);
 		if (IS_ERR(thread)) {
 			mutex_lock(&nbd->tx_lock);
+			nbd->pid = 0;
 			return PTR_ERR(thread);
 		}
 		wake_up_process(thread);
@@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		kthread_stop(thread);
 
 		mutex_lock(&nbd->tx_lock);
+		nbd->pid = 0;
 		if (error)
 			return error;
 		sock_shutdown(nbd, 0);
@@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 			sockfd_put(sock);
 		nbd->flags = 0;
 		nbd->bytesize = 0;
+		nbd->xmit_timeout = 0;
 		bdev->bd_inode->i_size = 0;
 		set_capacity(nbd->disk, 0);
 		if (max_part > 0)
@@ -874,6 +960,8 @@ static int __init nbd_init(void)
 		init_waitqueue_head(&nbd_dev[i].waiting_wq);
 		nbd_dev[i].blksize = 1024;
 		nbd_dev[i].bytesize = 0;
+		spin_lock_init(&nbd_dev[i].timer_lock);
+		init_timer(&nbd_dev[i].req_timer);
 		disk->major = NBD_MAJOR;
 		disk->first_minor = i << part_shift;
 		disk->fops = &nbd_fops;
diff --git a/include/linux/nbd.h b/include/linux/nbd.h
index f62f78a..c1280ca 100644
--- a/include/linux/nbd.h
+++ b/include/linux/nbd.h
@@ -41,6 +41,12 @@ struct nbd_device {
 	pid_t pid; /* pid of nbd-client, if attached */
 	int xmit_timeout;
 	int disconnect; /* a disconnect has been requested by user */
+	spinlock_t timer_lock;
+	struct timer_list req_timer;
+	int inflight;
+	int timedout;
+	struct task_struct *sender;
+	struct task_struct *receiver;
 };
 
 #endif

Reply to:

Follow-Ups:
- Re: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
  - From: Hermann Lauer <Hermann.Lauer@...1489...>

References:
- [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
  - From: Hermann Lauer <Hermann.Lauer@...1489...>

Prev by Date: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
Next by Date: Re: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
Previous by thread: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
Next by thread: Re: [Nbd] nbd timeout on debian jessie (Was: Re: oops with timeout option on debian wheezy kernel)
Index(es):
- Date
- Thread