Re: [Nbd] [PATCH] nbd: improve request timeouts handling

To: Michal Belczyk <belczyk@...1274...>
Cc: nbd-general@lists.sourceforge.net
Subject: Re: [Nbd] [PATCH] nbd: improve request timeouts handling
From: Markus Pargmann <mpa@...1897...>
Date: Thu, 5 Mar 2015 08:17:58 +0100
Message-id: <20150305071758.GB8062@...1897...>
In-reply-to: <20150303222639.GB8313@...1273...>
References: <20150226163712.GA84021@...1273...> <20150301203919.GA2121@...1897...> <20150303222639.GB8313@...1273...>
Hi Michal,

On Tue, Mar 03, 2015 at 11:26:39PM +0100, Michal Belczyk wrote:
> Hi Markus,
> 
> sorry for the late reply, please find my comments inline...

not late at all.

> 
> On Sun, Mar 01, 2015 at 09:39:19PM +0100, Markus Pargmann wrote:
> > Hi Michal,
> > 
> > First of all, thanks for resending this. It works for my test setup but
> > I have some comments inline below.
> > 
> > On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
> > > Signed-off-by: Michal Belczyk <belczyk@...1274...>
> > > ---
> > >  drivers/block/nbd.c | 150 +++++++++++++++++++++++++++++++++++++++++-----------
> > >  include/linux/nbd.h |   6 +++
> > >  2 files changed, 125 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > > index 4bc2a5c..cc4a98a 100644
> > > --- a/drivers/block/nbd.c
> > > +++ b/drivers/block/nbd.c
> > > @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int lock)
> > >  
> > >  static void nbd_xmit_timeout(unsigned long arg)
> > >  {
> > > -	struct task_struct *task = (struct task_struct *)arg;
> > > +	struct nbd_device *nbd = (struct nbd_device *)arg;
> > > +	struct task_struct *task_ary[2];
> > > +	unsigned long flags;
> > > +	int i;
> > >  
> > > -	printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > > -		task->comm, task->pid);
> > > -	force_sig(SIGKILL, task);
> > > +	spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +	nbd->timedout = 1;
> > > +	task_ary[0] = nbd->sender;
> > > +	task_ary[1] = nbd->receiver;
> > > +	for (i = 0; i < 2; i++) {
> > > +		if (task_ary[i] == NULL)
> > > +			continue;
> > > +		printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > > +			task_ary[i]->comm, task_ary[i]->pid);
> > > +		force_sig(SIGKILL, task_ary[i]);
> > > +		break;
> > > +	}
> > > +	spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > >  }
> > >  
> > >  /*
> > > @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> > >  	struct msghdr msg;
> > >  	struct kvec iov;
> > >  	sigset_t blocked, oldset;
> > > -	unsigned long pflags = current->flags;
> > > +	unsigned long flags, pflags = current->flags;
> > 
> > I would prefer variable declerations in seperate lines, especially if
> > there is an assignment involved, for readability.
> 
> Feel free to change it to whatever style you prefer.
> 
> 
> > >  	if (unlikely(!sock)) {
> > >  		dev_err(disk_to_dev(nbd->disk),
> > > @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> > >  		msg.msg_controllen = 0;
> > >  		msg.msg_flags = msg_flags | MSG_NOSIGNAL;
> > >  
> > > -		if (send) {
> > > -			struct timer_list ti;
> > > -
> > > -			if (nbd->xmit_timeout) {
> > > -				init_timer(&ti);
> > > -				ti.function = nbd_xmit_timeout;
> > > -				ti.data = (unsigned long)current;
> > > -				ti.expires = jiffies + nbd->xmit_timeout;
> > > -				add_timer(&ti);
> > > +		if (nbd->xmit_timeout) {
> > > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +			if (nbd->timedout) {
> > > +				spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +				printk(KERN_WARNING
> > > +					"nbd (pid %d: %s) timed out\n",
> > > +					task_pid_nr(current), current->comm);
> > > +				result = -EINTR;
> > > +				sock_shutdown(nbd, !send);
> > > +				break;
> > 
> > I am not a big fan of having this block with a sock_shutdown at two
> > points in sock_xmit, here as reaction on nbd->timedout and below as
> > handler of the signal received.
> 
> Me neither, feel free to change it to whatever you prefer as long as it
> is functionally the same thing.
> The whole patch was not meant to be pushed as-is but rather as an
> initial sketch of code to discuss, so far ignored for over a year...

Yes, sorry for that.

> 
> 
> > >  			}
> > > +			if (send)
> > > +				nbd->sender = current;
> > > +			else
> > > +				nbd->receiver = current;
> > > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +		}
> > > +
> > 
> > What happens if the timer interrupt is happens right at this point?
> > The other process may just fall out of the kernel_send/recvmsg() call
> > through the signal and call into sock_shutdown(). There it would wait
> > for the tx_lock (assuming it is receiving). At the same time this
> > process would continue into kernel_sendmsg. There is no active timer
> > anymore and the other process is waiting in the lock.
> > Is there something I am missing or may this be a problem?
> 
> That bit of the code path is somewhat tricky.
> Most writes to the socket will be non-blocking as long as the sockbuf
> permits.  If I understand correctly you are most concerned of the path
> where the SIGKILL(timeout) to the sender thread is queued prior to entering
> kernel_sendmsg() which would block it indefinitely and also cause the
> receiver process to wait for the tx_lock in its shutdown path, right?
> When that happens, the semantics should be similar to a regular write()
> on the blocking socket -- return -1 and set errno to EINTR unless
> sigaction(SA_RESTART) foo was done, etc.
> This actually works almost the same way in our case -- the pending
> signal will cause kernel_sendmsg() to eventually return -ERESTARTSYS.
> You may actually verify this theory by adding:
> 
>     if (send) {
>         force_sig(SIGKILL, current);
> 	result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> 	printk(KERN_WARNING "result=%d\n", result);
> 	if (result >= 0)
> 	    flush_signals(current);
>     }
> 
> Then if you reduce the default sockbuf size (net.ipv4.tcp_wmem) to, say,
> 4KB for min/initial/max, a simple dd if=/dev/zero of=/dev/nbdX bs=512K
> count=1 oflag=direct should do the trick -- dmesg will reveal what
> happened...
> 
> Please correct me if I am wrong.

Thanks for the explanation, I forgot that there is a signal pending for
the second process as well. I will test this to be sure but it should work
then.

> 
> 
> > > +		if (send)
> > >  			result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> > > -			if (nbd->xmit_timeout)
> > > -				del_timer_sync(&ti);
> > > -		} else
> > > +		else
> > >  			result = kernel_recvmsg(sock, &msg, &iov, 1, size,
> > >  						msg.msg_flags);
> > >  
> > > +		if (nbd->xmit_timeout) {
> > > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +			if (send)
> > > +				nbd->sender = NULL;
> > > +			else
> > > +				nbd->receiver = NULL;
> > > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +		}
> > > +
> > >  		if (signal_pending(current)) {
> > >  			siginfo_t info;
> > >  			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
> > > @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> > >  }
> > >  
> > >  static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
> > > -		int flags)
> > > +		int msg_flags)
> > >  {
> > >  	int result;
> > >  	void *kaddr = kmap(bvec->bv_page);
> > >  	result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
> > > -			   bvec->bv_len, flags);
> > > +			   bvec->bv_len, msg_flags);
> > 
> > This renaming from flags to msg_flags seems unrelated. I also don't see
> > why this is necessary. Please split this into a separate patch if it is
> > necessary.
> 
> This is unrelated, you may well remove it from the patch.
> 
> 
> > >  	kunmap(bvec->bv_page);
> > >  	return result;
> > >  }
> > > @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
> > >  /* always call with the tx_lock held */
> > >  static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> > >  {
> > > -	int result, flags;
> > > +	int result, msg_flags;
> > 
> > Same as above, please rename msg_flags in a separate patch.
> > Perhaps 'irq_flags' for the new 'flags' variable would be better?
> 
> Whatever you prefer.
> 
> 
> > >  	struct nbd_request request;
> > > -	unsigned long size = blk_rq_bytes(req);
> > > +	unsigned long flags, size = blk_rq_bytes(req);
> > >  
> > >  	memset(&request, 0, sizeof(request));
> > >  	request.magic = htonl(NBD_REQUEST_MAGIC);
> > > @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> > >  	}
> > >  	memcpy(request.handle, &req, sizeof(req));
> > >  
> > > +	if (nbd->xmit_timeout) {
> > > +		spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +		if (!nbd->inflight) {
> > > +			nbd->req_timer.function = nbd_xmit_timeout;
> > > +			nbd->req_timer.data = (unsigned long)nbd;
> > 
> > Isn't it possible to assign the above two at initialization?
> 
> I think the first two should be ok to move to the init code.
> 
> 
> > > +			nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
> > > +			add_timer(&nbd->req_timer);
> > > +		}
> > > +		nbd->inflight++;
> > > +		BUG_ON(nbd->inflight <= 0);
> > > +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +	}
> > > +
> > >  	dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
> > >  			nbd->disk->disk_name, req,
> > >  			nbdcmd_to_ascii(nbd_cmd(req)),
> > > @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> > >  		 * whether to set MSG_MORE or not...
> > >  		 */
> > >  		rq_for_each_segment(bvec, req, iter) {
> > > -			flags = 0;
> > > +			msg_flags = 0;
> > >  			if (!rq_iter_last(bvec, iter))
> > > -				flags = MSG_MORE;
> > > +				msg_flags = MSG_MORE;
> > >  			dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n",
> > >  					nbd->disk->disk_name, req, bvec.bv_len);
> > > -			result = sock_send_bvec(nbd, &bvec, flags);
> > > +			result = sock_send_bvec(nbd, &bvec, msg_flags);
> > >  			if (result <= 0) {
> > >  				dev_err(disk_to_dev(nbd->disk),
> > >  					"Send data failed (result %d)\n",
> > > @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> > >  	return 0;
> > >  
> > >  error_out:
> > > +	if (nbd->xmit_timeout) {
> > > +		spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +		nbd->inflight--;
> > > +		BUG_ON(nbd->inflight < 0);
> > 
> > nbd->inflight was checked above already with BUG_ON().
> 
> The BUG_ON()'s may well be removed -- this is a leftover from my initial
> testing of this patch to make sure the counters do not overflow or go
> negative -- do whatever you want with it (I'd leave them as they are,
> but it's up to you).
> 
> 
> > > +		if (!nbd->inflight)
> > > +			del_timer_sync(&nbd->req_timer);
> > > +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +	}
> > >  	return -EIO;
> > >  }
> > >  
> > > @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
> > >  static int nbd_do_it(struct nbd_device *nbd)
> > >  {
> > >  	struct request *req;
> > > +	unsigned long flags;
> > >  	int ret;
> > >  
> > >  	BUG_ON(nbd->magic != NBD_MAGIC);
> > >  
> > >  	sk_set_memalloc(nbd->sock->sk);
> > > -	nbd->pid = task_pid_nr(current);
> > >  	ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
> > >  	if (ret) {
> > >  		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
> > > -		nbd->pid = 0;
> > >  		return ret;
> > >  	}
> > >  
> > > -	while ((req = nbd_read_stat(nbd)) != NULL)
> > > -		nbd_end_request(req);
> > > +	for (;;) {
> > > +		req = nbd_read_stat(nbd);
> > > +		if (nbd->xmit_timeout) {
> > > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +			if (req != NULL) {
> > > +				nbd->inflight--;
> > > +				BUG_ON(nbd->inflight < 0);
> > > +			}
> > > +			if (req != NULL && nbd->inflight)
> > > +				mod_timer(&nbd->req_timer,
> > > +					  jiffies + nbd->xmit_timeout);
> > > +			else
> > > +				del_timer_sync(&nbd->req_timer);
> > > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +		}
> > > +		if (req != NULL) {
> > > +			nbd_end_request(req);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > 
> > Please make this loop a bit more readable. Perhaps something like this?
> > 
> > 	while ((req = nbd_read_stat(nbd)) != NULL) {
> > 		nbd_end_request(req);
> > 		if (nbd->xmit_timeout) {
> > 			...
> > 		}
> > 	}
> > 	if (nbd->xmit_timeout) {
> > 		lock();
> > 		del_timer_sync();
> > 		unlock();
> > 	}
> 
> I'm fine with either -- a loop is a loop.  Rework it any way which makes
> you more comfortable to work with this code later.
> 
> 
> > >  	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
> > > -	nbd->pid = 0;
> > 
> > Why is this removed?
> 
> Not sure over a year after the initial patch submission but looking at
> my commit message it must have something to do with its last bit -- the
> non-locked access to nbd->pid?
> 
> 
> > >  	return 0;
> > >  }
> > >  
> > > @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> > >  		set_capacity(nbd->disk, nbd->bytesize >> 9);
> > >  		return 0;
> > >  
> > > -	case NBD_SET_TIMEOUT:
> > > -		nbd->xmit_timeout = arg * HZ;
> > > +	case NBD_SET_TIMEOUT: {
> > > +		int xt;
> > > +
> > > +		xt = arg * HZ;
> > > +		if (xt < 0)
> > > +			return -EINVAL;
> > > +		if (nbd->pid &&
> > 
> > Could you use nbd->sock here please? This doesn't really depend on a pid
> > does it?
> 
> I remember that when I started looking at this code initially I found
> the whole nbd->pid / nbd->sock kind of redundant... I stuck with using
> nbd->pid but maybe it was the wrong way...

That's fine. I also think they are kind of redundant. I will probably
replace the usage of nbd->pid with nbd->sock so that nbd->pid is just
exported as read-only file to userspace as debugging help.

> 
> 
> > > +		    ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
> > > +			return -EBUSY;
> > > +		dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
> > > +			nbd->xmit_timeout / HZ, xt / HZ);
> > 
> > Please use dev_dbg here.
> 
> Feel free to change it to what now is preferred.
> 
> 
> > > +		nbd->xmit_timeout = xt;
> > >  		return 0;
> > > +	}
> > >  
> > >  	case NBD_SET_FLAGS:
> > >  		nbd->flags = arg;
> > > @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> > >  		if (!nbd->sock)
> > >  			return -EINVAL;
> > >  
> > > +		nbd->pid = task_pid_nr(current);
> > > +		nbd->inflight = 0;
> > > +		nbd->timedout = 0;
> > > +		nbd->sender = NULL;
> > > +		nbd->receiver = NULL;
> > >  		mutex_unlock(&nbd->tx_lock);
> > >  
> > >  		if (nbd->flags & NBD_FLAG_READ_ONLY)
> > > @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> > >  					nbd->disk->disk_name);
> > >  		if (IS_ERR(thread)) {
> > >  			mutex_lock(&nbd->tx_lock);
> > > +			nbd->pid = 0;
> > 
> > I can't really see why it is necessary to get nbd->pid locked? And if
> > you need this, please split it into a separate patch.
> 
> I think it has to be locked for multiple concurrent nbd-client
> invocations, doesn't it?
> For already established /dev/nbdX you may run things like nbd-client -l
> which end up running exactly the same kernel code and using the same nbd
> structure in the kernel, right?  Or am I being paranoid here?

Yes they are using nbd->pid as well. But there shouldn't be any
concurrent users of this that highly depend on this. But I see why it
may be better to lock this.

Thanks,

Markus Pargmann

> 
> 
> > >  			return PTR_ERR(thread);
> > >  		}
> > >  		wake_up_process(thread);
> > > @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> > >  		kthread_stop(thread);
> > >  
> > >  		mutex_lock(&nbd->tx_lock);
> > > +		nbd->pid = 0;
> > >  		if (error)
> > >  			return error;
> > >  		sock_shutdown(nbd, 0);
> > > @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> > >  			sockfd_put(sock);
> > >  		nbd->flags = 0;
> > >  		nbd->bytesize = 0;
> > > +		nbd->xmit_timeout = 0;
> > >  		bdev->bd_inode->i_size = 0;
> > >  		set_capacity(nbd->disk, 0);
> > >  		if (max_part > 0)
> > > @@ -874,6 +960,8 @@ static int __init nbd_init(void)
> > >  		init_waitqueue_head(&nbd_dev[i].waiting_wq);
> > >  		nbd_dev[i].blksize = 1024;
> > >  		nbd_dev[i].bytesize = 0;
> > > +		spin_lock_init(&nbd_dev[i].timer_lock);
> > > +		init_timer(&nbd_dev[i].req_timer);
> > >  		disk->major = NBD_MAJOR;
> > >  		disk->first_minor = i << part_shift;
> > >  		disk->fops = &nbd_fops;
> > > diff --git a/include/linux/nbd.h b/include/linux/nbd.h
> > > index f62f78a..c1280ca 100644
> > > --- a/include/linux/nbd.h
> > > +++ b/include/linux/nbd.h
> > > @@ -41,6 +41,12 @@ struct nbd_device {
> > >  	pid_t pid; /* pid of nbd-client, if attached */
> > >  	int xmit_timeout;
> > >  	int disconnect; /* a disconnect has been requested by user */
> > > +	spinlock_t timer_lock;
> > > +	struct timer_list req_timer;
> > > +	int inflight;
> > > +	int timedout;
> > 
> > Do we really need 'timedout'? or is it possible to use the pending
> > signal to immediately shutdown the nbd?
> 
> Hmm... possibly.
> 
> Kind regards,
> 
> -- 
> Michal Belczyk Sr.
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Attachment: signature.asc
Description: Digital signature
Reply to:
References:
- Re: [Nbd] [PATCH] nbd: improve request timeouts handling
  - From: Markus Pargmann <mpa@...1897...>
- Re: [Nbd] [PATCH] nbd: improve request timeouts handling
  - From: Michal Belczyk <belczyk@...1274...>
Prev by Date: Re: [Nbd] [PATCH] nbd: improve request timeouts handling
Next by Date: Re: [Nbd] nbd 3.9 released
Previous by thread: Re: [Nbd] [PATCH] nbd: improve request timeouts handling
Next by thread: [Nbd] nbd 3.9 released
Index(es):
- Date
- Thread