Re: [Nbd] [PATCH] nbd: improve request timeouts handling

To: Michal Belczyk <belczyk@...1274...>
Cc: nbd-general@lists.sourceforge.net
Subject: Re: [Nbd] [PATCH] nbd: improve request timeouts handling
From: Markus Pargmann <mpa@...1897...>
Date: Sun, 1 Mar 2015 21:39:19 +0100
Message-id: <20150301203919.GA2121@...1897...>
In-reply-to: <20150226163712.GA84021@...1273...>
References: <20150226163712.GA84021@...1273...>
Hi Michal,

First of all, thanks for resending this. It works for my test setup but
I have some comments inline below.

On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
> The main idea behind it is to be able to quickly detect broken replica
> and switch over to another when used with any sort of mirror type device
> built on top of any number of nbd devices.
> 
> Before this change a request would time out causing the socket to be
> shut down and the device to fail in case of a dead server or removed
> network cable only if:
> 
>   a) either the timer around kernel_sendmsg() kicked in
>   b) or the TCP failures on retransmission finally caused an error
>      on the socket, likely blocked on kernel_recvmsg() at this time,
>      waiting for replies from the server
> 
> Case a) depends mostly on the size of requests issued and on the maximum
> size of the socket buffer -- a lot of read request headers or small
> write requests could be "sent" without triggering the requested timeout
> 
> Case b) timeout is independent of nbd-client -t <timeout> option
> as there is no TCP_USER_TIMEOUT set on the client socket by default.
> And even if such timeout was set it would not solve the problem of
> an nbd-client hung on receiving replies for much longer time without
> setting TCP keep-alives... and that would be the third, independent
> timeout setting required to make it work "almost" as expected...
> 
> So, instead, take the big hammer approach and:
> 
>   *) trace the number of outstanding requests sent to the server
>      (nbd->inflight)
>   *) enable the timer (nbd->req_timer) before the first request
>      is submitted and leave it enabled
>   *) when sending next request do not touch the timer (it is up
>      to the receiving side to control it at this point)
>   *) on receive side update the timer every time a response
>      is collected but there are more to read from the server
>   *) disable the timer whenever the inflight counter drops
>      to zero or an error (leading to the socket shutdown)
>      is returned
> 
> This patch does NOT prevent the server to process a request for longer
> than the timeout specified if only it replies to any other request
> submitted within the timeout (the server still may reply to a batch
> of requests in any order).
> 
> Only the nbd->xmit_timeout != 0 code path is changed so the patch should
> not affect nbd connections running without an explicit timeout set
> on the nbd-client command line.
> 
> There is also no way to enable or disable the timeout on an active
> (nbd->pid != 0) nbd device, it is however possible to change its value.
> Otherwise the inflight request counter would have to affect the nbd
> devices enabled without nbd-client -t <timeout>.
> 
> Also move nbd->pid modifications behind nbd->tx_lock wherever possible
> to avoid races between the concurrent nbd-client invocations.

This commit message is really good.

> 
> Signed-off-by: Michal Belczyk <belczyk@...1274...>
> ---
>  drivers/block/nbd.c | 150 +++++++++++++++++++++++++++++++++++++++++-----------
>  include/linux/nbd.h |   6 +++
>  2 files changed, 125 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 4bc2a5c..cc4a98a 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int lock)
>  
>  static void nbd_xmit_timeout(unsigned long arg)
>  {
> -	struct task_struct *task = (struct task_struct *)arg;
> +	struct nbd_device *nbd = (struct nbd_device *)arg;
> +	struct task_struct *task_ary[2];
> +	unsigned long flags;
> +	int i;
>  
> -	printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> -		task->comm, task->pid);
> -	force_sig(SIGKILL, task);
> +	spin_lock_irqsave(&nbd->timer_lock, flags);
> +	nbd->timedout = 1;
> +	task_ary[0] = nbd->sender;
> +	task_ary[1] = nbd->receiver;
> +	for (i = 0; i < 2; i++) {
> +		if (task_ary[i] == NULL)
> +			continue;
> +		printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> +			task_ary[i]->comm, task_ary[i]->pid);
> +		force_sig(SIGKILL, task_ary[i]);
> +		break;
> +	}
> +	spin_unlock_irqrestore(&nbd->timer_lock, flags);
>  }
>  
>  /*
> @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
>  	struct msghdr msg;
>  	struct kvec iov;
>  	sigset_t blocked, oldset;
> -	unsigned long pflags = current->flags;
> +	unsigned long flags, pflags = current->flags;

I would prefer variable declerations in seperate lines, especially if
there is an assignment involved, for readability.

>  
>  	if (unlikely(!sock)) {
>  		dev_err(disk_to_dev(nbd->disk),
> @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
>  		msg.msg_controllen = 0;
>  		msg.msg_flags = msg_flags | MSG_NOSIGNAL;
>  
> -		if (send) {
> -			struct timer_list ti;
> -
> -			if (nbd->xmit_timeout) {
> -				init_timer(&ti);
> -				ti.function = nbd_xmit_timeout;
> -				ti.data = (unsigned long)current;
> -				ti.expires = jiffies + nbd->xmit_timeout;
> -				add_timer(&ti);
> +		if (nbd->xmit_timeout) {
> +			spin_lock_irqsave(&nbd->timer_lock, flags);
> +			if (nbd->timedout) {
> +				spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +				printk(KERN_WARNING
> +					"nbd (pid %d: %s) timed out\n",
> +					task_pid_nr(current), current->comm);
> +				result = -EINTR;
> +				sock_shutdown(nbd, !send);
> +				break;

I am not a big fan of having this block with a sock_shutdown at two
points in sock_xmit, here as reaction on nbd->timedout and below as
handler of the signal received.

>  			}
> +			if (send)
> +				nbd->sender = current;
> +			else
> +				nbd->receiver = current;
> +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +		}
> +

What happens if the timer interrupt is happens right at this point?
The other process may just fall out of the kernel_send/recvmsg() call
through the signal and call into sock_shutdown(). There it would wait
for the tx_lock (assuming it is receiving). At the same time this
process would continue into kernel_sendmsg. There is no active timer
anymore and the other process is waiting in the lock.
Is there something I am missing or may this be a problem?

> +		if (send)
>  			result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> -			if (nbd->xmit_timeout)
> -				del_timer_sync(&ti);
> -		} else
> +		else
>  			result = kernel_recvmsg(sock, &msg, &iov, 1, size,
>  						msg.msg_flags);
>  
> +		if (nbd->xmit_timeout) {
> +			spin_lock_irqsave(&nbd->timer_lock, flags);
> +			if (send)
> +				nbd->sender = NULL;
> +			else
> +				nbd->receiver = NULL;
> +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +		}
> +
>  		if (signal_pending(current)) {
>  			siginfo_t info;
>  			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
> @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
>  }
>  
>  static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
> -		int flags)
> +		int msg_flags)
>  {
>  	int result;
>  	void *kaddr = kmap(bvec->bv_page);
>  	result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
> -			   bvec->bv_len, flags);
> +			   bvec->bv_len, msg_flags);

This renaming from flags to msg_flags seems unrelated. I also don't see
why this is necessary. Please split this into a separate patch if it is
necessary.

>  	kunmap(bvec->bv_page);
>  	return result;
>  }
> @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
>  /* always call with the tx_lock held */
>  static int nbd_send_req(struct nbd_device *nbd, struct request *req)
>  {
> -	int result, flags;
> +	int result, msg_flags;

Same as above, please rename msg_flags in a separate patch.
Perhaps 'irq_flags' for the new 'flags' variable would be better?

>  	struct nbd_request request;
> -	unsigned long size = blk_rq_bytes(req);
> +	unsigned long flags, size = blk_rq_bytes(req);
>  
>  	memset(&request, 0, sizeof(request));
>  	request.magic = htonl(NBD_REQUEST_MAGIC);
> @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
>  	}
>  	memcpy(request.handle, &req, sizeof(req));
>  
> +	if (nbd->xmit_timeout) {
> +		spin_lock_irqsave(&nbd->timer_lock, flags);
> +		if (!nbd->inflight) {
> +			nbd->req_timer.function = nbd_xmit_timeout;
> +			nbd->req_timer.data = (unsigned long)nbd;

Isn't it possible to assign the above two at initialization?

> +			nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
> +			add_timer(&nbd->req_timer);
> +		}
> +		nbd->inflight++;
> +		BUG_ON(nbd->inflight <= 0);
> +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +	}
> +
>  	dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
>  			nbd->disk->disk_name, req,
>  			nbdcmd_to_ascii(nbd_cmd(req)),
> @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
>  		 * whether to set MSG_MORE or not...
>  		 */
>  		rq_for_each_segment(bvec, req, iter) {
> -			flags = 0;
> +			msg_flags = 0;
>  			if (!rq_iter_last(bvec, iter))
> -				flags = MSG_MORE;
> +				msg_flags = MSG_MORE;
>  			dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n",
>  					nbd->disk->disk_name, req, bvec.bv_len);
> -			result = sock_send_bvec(nbd, &bvec, flags);
> +			result = sock_send_bvec(nbd, &bvec, msg_flags);
>  			if (result <= 0) {
>  				dev_err(disk_to_dev(nbd->disk),
>  					"Send data failed (result %d)\n",
> @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
>  	return 0;
>  
>  error_out:
> +	if (nbd->xmit_timeout) {
> +		spin_lock_irqsave(&nbd->timer_lock, flags);
> +		nbd->inflight--;
> +		BUG_ON(nbd->inflight < 0);

nbd->inflight was checked above already with BUG_ON().

> +		if (!nbd->inflight)
> +			del_timer_sync(&nbd->req_timer);
> +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +	}
>  	return -EIO;
>  }
>  
> @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
>  static int nbd_do_it(struct nbd_device *nbd)
>  {
>  	struct request *req;
> +	unsigned long flags;
>  	int ret;
>  
>  	BUG_ON(nbd->magic != NBD_MAGIC);
>  
>  	sk_set_memalloc(nbd->sock->sk);
> -	nbd->pid = task_pid_nr(current);
>  	ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
>  	if (ret) {
>  		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
> -		nbd->pid = 0;
>  		return ret;
>  	}
>  
> -	while ((req = nbd_read_stat(nbd)) != NULL)
> -		nbd_end_request(req);
> +	for (;;) {
> +		req = nbd_read_stat(nbd);
> +		if (nbd->xmit_timeout) {
> +			spin_lock_irqsave(&nbd->timer_lock, flags);
> +			if (req != NULL) {
> +				nbd->inflight--;
> +				BUG_ON(nbd->inflight < 0);
> +			}
> +			if (req != NULL && nbd->inflight)
> +				mod_timer(&nbd->req_timer,
> +					  jiffies + nbd->xmit_timeout);
> +			else
> +				del_timer_sync(&nbd->req_timer);
> +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +		}
> +		if (req != NULL) {
> +			nbd_end_request(req);
> +			continue;
> +		}
> +		break;
> +	}

Please make this loop a bit more readable. Perhaps something like this?

	while ((req = nbd_read_stat(nbd)) != NULL) {
		nbd_end_request(req);
		if (nbd->xmit_timeout) {
			...
		}
	}
	if (nbd->xmit_timeout) {
		lock();
		del_timer_sync();
		unlock();
	}

>  
>  	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
> -	nbd->pid = 0;

Why is this removed?

>  	return 0;
>  }
>  
> @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		set_capacity(nbd->disk, nbd->bytesize >> 9);
>  		return 0;
>  
> -	case NBD_SET_TIMEOUT:
> -		nbd->xmit_timeout = arg * HZ;
> +	case NBD_SET_TIMEOUT: {
> +		int xt;
> +
> +		xt = arg * HZ;
> +		if (xt < 0)
> +			return -EINVAL;
> +		if (nbd->pid &&

Could you use nbd->sock here please? This doesn't really depend on a pid
does it?

> +		    ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
> +			return -EBUSY;
> +		dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
> +			nbd->xmit_timeout / HZ, xt / HZ);

Please use dev_dbg here.

> +		nbd->xmit_timeout = xt;
>  		return 0;
> +	}
>  
>  	case NBD_SET_FLAGS:
>  		nbd->flags = arg;
> @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		if (!nbd->sock)
>  			return -EINVAL;
>  
> +		nbd->pid = task_pid_nr(current);
> +		nbd->inflight = 0;
> +		nbd->timedout = 0;
> +		nbd->sender = NULL;
> +		nbd->receiver = NULL;
>  		mutex_unlock(&nbd->tx_lock);
>  
>  		if (nbd->flags & NBD_FLAG_READ_ONLY)
> @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  					nbd->disk->disk_name);
>  		if (IS_ERR(thread)) {
>  			mutex_lock(&nbd->tx_lock);
> +			nbd->pid = 0;

I can't really see why it is necessary to get nbd->pid locked? And if
you need this, please split it into a separate patch.

>  			return PTR_ERR(thread);
>  		}
>  		wake_up_process(thread);
> @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  		kthread_stop(thread);
>  
>  		mutex_lock(&nbd->tx_lock);
> +		nbd->pid = 0;
>  		if (error)
>  			return error;
>  		sock_shutdown(nbd, 0);
> @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
>  			sockfd_put(sock);
>  		nbd->flags = 0;
>  		nbd->bytesize = 0;
> +		nbd->xmit_timeout = 0;
>  		bdev->bd_inode->i_size = 0;
>  		set_capacity(nbd->disk, 0);
>  		if (max_part > 0)
> @@ -874,6 +960,8 @@ static int __init nbd_init(void)
>  		init_waitqueue_head(&nbd_dev[i].waiting_wq);
>  		nbd_dev[i].blksize = 1024;
>  		nbd_dev[i].bytesize = 0;
> +		spin_lock_init(&nbd_dev[i].timer_lock);
> +		init_timer(&nbd_dev[i].req_timer);
>  		disk->major = NBD_MAJOR;
>  		disk->first_minor = i << part_shift;
>  		disk->fops = &nbd_fops;
> diff --git a/include/linux/nbd.h b/include/linux/nbd.h
> index f62f78a..c1280ca 100644
> --- a/include/linux/nbd.h
> +++ b/include/linux/nbd.h
> @@ -41,6 +41,12 @@ struct nbd_device {
>  	pid_t pid; /* pid of nbd-client, if attached */
>  	int xmit_timeout;
>  	int disconnect; /* a disconnect has been requested by user */
> +	spinlock_t timer_lock;
> +	struct timer_list req_timer;
> +	int inflight;
> +	int timedout;

Do we really need 'timedout'? or is it possible to use the pending
signal to immediately shutdown the nbd?

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Attachment: signature.asc
Description: Digital signature
Reply to:
Follow-Ups:
- Re: [Nbd] [PATCH] nbd: improve request timeouts handling
  - From: Michal Belczyk <belczyk@...1274...>
Next by Date: [Nbd] nbd 3.9 released
Next by thread: Re: [Nbd] [PATCH] nbd: improve request timeouts handling
Index(es):
- Date
- Thread