[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] [PATCH] nbd: improve request timeouts handling



Hi Markus,

sorry for the late reply, please find my comments inline...

On Sun, Mar 01, 2015 at 09:39:19PM +0100, Markus Pargmann wrote:
> Hi Michal,
> 
> First of all, thanks for resending this. It works for my test setup but
> I have some comments inline below.
> 
> On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
> > Signed-off-by: Michal Belczyk <belczyk@...1274...>
> > ---
> >  drivers/block/nbd.c | 150 +++++++++++++++++++++++++++++++++++++++++-----------
> >  include/linux/nbd.h |   6 +++
> >  2 files changed, 125 insertions(+), 31 deletions(-)
> > 
> > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > index 4bc2a5c..cc4a98a 100644
> > --- a/drivers/block/nbd.c
> > +++ b/drivers/block/nbd.c
> > @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int lock)
> >  
> >  static void nbd_xmit_timeout(unsigned long arg)
> >  {
> > -	struct task_struct *task = (struct task_struct *)arg;
> > +	struct nbd_device *nbd = (struct nbd_device *)arg;
> > +	struct task_struct *task_ary[2];
> > +	unsigned long flags;
> > +	int i;
> >  
> > -	printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > -		task->comm, task->pid);
> > -	force_sig(SIGKILL, task);
> > +	spin_lock_irqsave(&nbd->timer_lock, flags);
> > +	nbd->timedout = 1;
> > +	task_ary[0] = nbd->sender;
> > +	task_ary[1] = nbd->receiver;
> > +	for (i = 0; i < 2; i++) {
> > +		if (task_ary[i] == NULL)
> > +			continue;
> > +		printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > +			task_ary[i]->comm, task_ary[i]->pid);
> > +		force_sig(SIGKILL, task_ary[i]);
> > +		break;
> > +	}
> > +	spin_unlock_irqrestore(&nbd->timer_lock, flags);
> >  }
> >  
> >  /*
> > @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> >  	struct msghdr msg;
> >  	struct kvec iov;
> >  	sigset_t blocked, oldset;
> > -	unsigned long pflags = current->flags;
> > +	unsigned long flags, pflags = current->flags;
> 
> I would prefer variable declerations in seperate lines, especially if
> there is an assignment involved, for readability.

Feel free to change it to whatever style you prefer.


> >  	if (unlikely(!sock)) {
> >  		dev_err(disk_to_dev(nbd->disk),
> > @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> >  		msg.msg_controllen = 0;
> >  		msg.msg_flags = msg_flags | MSG_NOSIGNAL;
> >  
> > -		if (send) {
> > -			struct timer_list ti;
> > -
> > -			if (nbd->xmit_timeout) {
> > -				init_timer(&ti);
> > -				ti.function = nbd_xmit_timeout;
> > -				ti.data = (unsigned long)current;
> > -				ti.expires = jiffies + nbd->xmit_timeout;
> > -				add_timer(&ti);
> > +		if (nbd->xmit_timeout) {
> > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > +			if (nbd->timedout) {
> > +				spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +				printk(KERN_WARNING
> > +					"nbd (pid %d: %s) timed out\n",
> > +					task_pid_nr(current), current->comm);
> > +				result = -EINTR;
> > +				sock_shutdown(nbd, !send);
> > +				break;
> 
> I am not a big fan of having this block with a sock_shutdown at two
> points in sock_xmit, here as reaction on nbd->timedout and below as
> handler of the signal received.

Me neither, feel free to change it to whatever you prefer as long as it
is functionally the same thing.
The whole patch was not meant to be pushed as-is but rather as an
initial sketch of code to discuss, so far ignored for over a year...


> >  			}
> > +			if (send)
> > +				nbd->sender = current;
> > +			else
> > +				nbd->receiver = current;
> > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +		}
> > +
> 
> What happens if the timer interrupt is happens right at this point?
> The other process may just fall out of the kernel_send/recvmsg() call
> through the signal and call into sock_shutdown(). There it would wait
> for the tx_lock (assuming it is receiving). At the same time this
> process would continue into kernel_sendmsg. There is no active timer
> anymore and the other process is waiting in the lock.
> Is there something I am missing or may this be a problem?

That bit of the code path is somewhat tricky.
Most writes to the socket will be non-blocking as long as the sockbuf
permits.  If I understand correctly you are most concerned of the path
where the SIGKILL(timeout) to the sender thread is queued prior to entering
kernel_sendmsg() which would block it indefinitely and also cause the
receiver process to wait for the tx_lock in its shutdown path, right?
When that happens, the semantics should be similar to a regular write()
on the blocking socket -- return -1 and set errno to EINTR unless
sigaction(SA_RESTART) foo was done, etc.
This actually works almost the same way in our case -- the pending
signal will cause kernel_sendmsg() to eventually return -ERESTARTSYS.
You may actually verify this theory by adding:

    if (send) {
        force_sig(SIGKILL, current);
	result = kernel_sendmsg(sock, &msg, &iov, 1, size);
	printk(KERN_WARNING "result=%d\n", result);
	if (result >= 0)
	    flush_signals(current);
    }

Then if you reduce the default sockbuf size (net.ipv4.tcp_wmem) to, say,
4KB for min/initial/max, a simple dd if=/dev/zero of=/dev/nbdX bs=512K
count=1 oflag=direct should do the trick -- dmesg will reveal what
happened...

Please correct me if I am wrong.


> > +		if (send)
> >  			result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> > -			if (nbd->xmit_timeout)
> > -				del_timer_sync(&ti);
> > -		} else
> > +		else
> >  			result = kernel_recvmsg(sock, &msg, &iov, 1, size,
> >  						msg.msg_flags);
> >  
> > +		if (nbd->xmit_timeout) {
> > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > +			if (send)
> > +				nbd->sender = NULL;
> > +			else
> > +				nbd->receiver = NULL;
> > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +		}
> > +
> >  		if (signal_pending(current)) {
> >  			siginfo_t info;
> >  			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
> > @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
> >  }
> >  
> >  static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
> > -		int flags)
> > +		int msg_flags)
> >  {
> >  	int result;
> >  	void *kaddr = kmap(bvec->bv_page);
> >  	result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
> > -			   bvec->bv_len, flags);
> > +			   bvec->bv_len, msg_flags);
> 
> This renaming from flags to msg_flags seems unrelated. I also don't see
> why this is necessary. Please split this into a separate patch if it is
> necessary.

This is unrelated, you may well remove it from the patch.


> >  	kunmap(bvec->bv_page);
> >  	return result;
> >  }
> > @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
> >  /* always call with the tx_lock held */
> >  static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> >  {
> > -	int result, flags;
> > +	int result, msg_flags;
> 
> Same as above, please rename msg_flags in a separate patch.
> Perhaps 'irq_flags' for the new 'flags' variable would be better?

Whatever you prefer.


> >  	struct nbd_request request;
> > -	unsigned long size = blk_rq_bytes(req);
> > +	unsigned long flags, size = blk_rq_bytes(req);
> >  
> >  	memset(&request, 0, sizeof(request));
> >  	request.magic = htonl(NBD_REQUEST_MAGIC);
> > @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> >  	}
> >  	memcpy(request.handle, &req, sizeof(req));
> >  
> > +	if (nbd->xmit_timeout) {
> > +		spin_lock_irqsave(&nbd->timer_lock, flags);
> > +		if (!nbd->inflight) {
> > +			nbd->req_timer.function = nbd_xmit_timeout;
> > +			nbd->req_timer.data = (unsigned long)nbd;
> 
> Isn't it possible to assign the above two at initialization?

I think the first two should be ok to move to the init code.


> > +			nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
> > +			add_timer(&nbd->req_timer);
> > +		}
> > +		nbd->inflight++;
> > +		BUG_ON(nbd->inflight <= 0);
> > +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +	}
> > +
> >  	dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
> >  			nbd->disk->disk_name, req,
> >  			nbdcmd_to_ascii(nbd_cmd(req)),
> > @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> >  		 * whether to set MSG_MORE or not...
> >  		 */
> >  		rq_for_each_segment(bvec, req, iter) {
> > -			flags = 0;
> > +			msg_flags = 0;
> >  			if (!rq_iter_last(bvec, iter))
> > -				flags = MSG_MORE;
> > +				msg_flags = MSG_MORE;
> >  			dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n",
> >  					nbd->disk->disk_name, req, bvec.bv_len);
> > -			result = sock_send_bvec(nbd, &bvec, flags);
> > +			result = sock_send_bvec(nbd, &bvec, msg_flags);
> >  			if (result <= 0) {
> >  				dev_err(disk_to_dev(nbd->disk),
> >  					"Send data failed (result %d)\n",
> > @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> >  	return 0;
> >  
> >  error_out:
> > +	if (nbd->xmit_timeout) {
> > +		spin_lock_irqsave(&nbd->timer_lock, flags);
> > +		nbd->inflight--;
> > +		BUG_ON(nbd->inflight < 0);
> 
> nbd->inflight was checked above already with BUG_ON().

The BUG_ON()'s may well be removed -- this is a leftover from my initial
testing of this patch to make sure the counters do not overflow or go
negative -- do whatever you want with it (I'd leave them as they are,
but it's up to you).


> > +		if (!nbd->inflight)
> > +			del_timer_sync(&nbd->req_timer);
> > +		spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +	}
> >  	return -EIO;
> >  }
> >  
> > @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
> >  static int nbd_do_it(struct nbd_device *nbd)
> >  {
> >  	struct request *req;
> > +	unsigned long flags;
> >  	int ret;
> >  
> >  	BUG_ON(nbd->magic != NBD_MAGIC);
> >  
> >  	sk_set_memalloc(nbd->sock->sk);
> > -	nbd->pid = task_pid_nr(current);
> >  	ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
> >  	if (ret) {
> >  		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
> > -		nbd->pid = 0;
> >  		return ret;
> >  	}
> >  
> > -	while ((req = nbd_read_stat(nbd)) != NULL)
> > -		nbd_end_request(req);
> > +	for (;;) {
> > +		req = nbd_read_stat(nbd);
> > +		if (nbd->xmit_timeout) {
> > +			spin_lock_irqsave(&nbd->timer_lock, flags);
> > +			if (req != NULL) {
> > +				nbd->inflight--;
> > +				BUG_ON(nbd->inflight < 0);
> > +			}
> > +			if (req != NULL && nbd->inflight)
> > +				mod_timer(&nbd->req_timer,
> > +					  jiffies + nbd->xmit_timeout);
> > +			else
> > +				del_timer_sync(&nbd->req_timer);
> > +			spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > +		}
> > +		if (req != NULL) {
> > +			nbd_end_request(req);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> 
> Please make this loop a bit more readable. Perhaps something like this?
> 
> 	while ((req = nbd_read_stat(nbd)) != NULL) {
> 		nbd_end_request(req);
> 		if (nbd->xmit_timeout) {
> 			...
> 		}
> 	}
> 	if (nbd->xmit_timeout) {
> 		lock();
> 		del_timer_sync();
> 		unlock();
> 	}

I'm fine with either -- a loop is a loop.  Rework it any way which makes
you more comfortable to work with this code later.


> >  	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
> > -	nbd->pid = 0;
> 
> Why is this removed?

Not sure over a year after the initial patch submission but looking at
my commit message it must have something to do with its last bit -- the
non-locked access to nbd->pid?


> >  	return 0;
> >  }
> >  
> > @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> >  		set_capacity(nbd->disk, nbd->bytesize >> 9);
> >  		return 0;
> >  
> > -	case NBD_SET_TIMEOUT:
> > -		nbd->xmit_timeout = arg * HZ;
> > +	case NBD_SET_TIMEOUT: {
> > +		int xt;
> > +
> > +		xt = arg * HZ;
> > +		if (xt < 0)
> > +			return -EINVAL;
> > +		if (nbd->pid &&
> 
> Could you use nbd->sock here please? This doesn't really depend on a pid
> does it?

I remember that when I started looking at this code initially I found
the whole nbd->pid / nbd->sock kind of redundant... I stuck with using
nbd->pid but maybe it was the wrong way...


> > +		    ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
> > +			return -EBUSY;
> > +		dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
> > +			nbd->xmit_timeout / HZ, xt / HZ);
> 
> Please use dev_dbg here.

Feel free to change it to what now is preferred.


> > +		nbd->xmit_timeout = xt;
> >  		return 0;
> > +	}
> >  
> >  	case NBD_SET_FLAGS:
> >  		nbd->flags = arg;
> > @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> >  		if (!nbd->sock)
> >  			return -EINVAL;
> >  
> > +		nbd->pid = task_pid_nr(current);
> > +		nbd->inflight = 0;
> > +		nbd->timedout = 0;
> > +		nbd->sender = NULL;
> > +		nbd->receiver = NULL;
> >  		mutex_unlock(&nbd->tx_lock);
> >  
> >  		if (nbd->flags & NBD_FLAG_READ_ONLY)
> > @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> >  					nbd->disk->disk_name);
> >  		if (IS_ERR(thread)) {
> >  			mutex_lock(&nbd->tx_lock);
> > +			nbd->pid = 0;
> 
> I can't really see why it is necessary to get nbd->pid locked? And if
> you need this, please split it into a separate patch.

I think it has to be locked for multiple concurrent nbd-client
invocations, doesn't it?
For already established /dev/nbdX you may run things like nbd-client -l
which end up running exactly the same kernel code and using the same nbd
structure in the kernel, right?  Or am I being paranoid here?


> >  			return PTR_ERR(thread);
> >  		}
> >  		wake_up_process(thread);
> > @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> >  		kthread_stop(thread);
> >  
> >  		mutex_lock(&nbd->tx_lock);
> > +		nbd->pid = 0;
> >  		if (error)
> >  			return error;
> >  		sock_shutdown(nbd, 0);
> > @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
> >  			sockfd_put(sock);
> >  		nbd->flags = 0;
> >  		nbd->bytesize = 0;
> > +		nbd->xmit_timeout = 0;
> >  		bdev->bd_inode->i_size = 0;
> >  		set_capacity(nbd->disk, 0);
> >  		if (max_part > 0)
> > @@ -874,6 +960,8 @@ static int __init nbd_init(void)
> >  		init_waitqueue_head(&nbd_dev[i].waiting_wq);
> >  		nbd_dev[i].blksize = 1024;
> >  		nbd_dev[i].bytesize = 0;
> > +		spin_lock_init(&nbd_dev[i].timer_lock);
> > +		init_timer(&nbd_dev[i].req_timer);
> >  		disk->major = NBD_MAJOR;
> >  		disk->first_minor = i << part_shift;
> >  		disk->fops = &nbd_fops;
> > diff --git a/include/linux/nbd.h b/include/linux/nbd.h
> > index f62f78a..c1280ca 100644
> > --- a/include/linux/nbd.h
> > +++ b/include/linux/nbd.h
> > @@ -41,6 +41,12 @@ struct nbd_device {
> >  	pid_t pid; /* pid of nbd-client, if attached */
> >  	int xmit_timeout;
> >  	int disconnect; /* a disconnect has been requested by user */
> > +	spinlock_t timer_lock;
> > +	struct timer_list req_timer;
> > +	int inflight;
> > +	int timedout;
> 
> Do we really need 'timedout'? or is it possible to use the pending
> signal to immediately shutdown the nbd?

Hmm... possibly.

Kind regards,

-- 
Michal Belczyk Sr.



Reply to: