Hi Michal, On Tue, Mar 03, 2015 at 11:26:39PM +0100, Michal Belczyk wrote: > Hi Markus, > > sorry for the late reply, please find my comments inline... not late at all. > > On Sun, Mar 01, 2015 at 09:39:19PM +0100, Markus Pargmann wrote: > > Hi Michal, > > > > First of all, thanks for resending this. It works for my test setup but > > I have some comments inline below. > > > > On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote: > > > Signed-off-by: Michal Belczyk <belczyk@...1274...> > > > --- > > > drivers/block/nbd.c | 150 +++++++++++++++++++++++++++++++++++++++++----------- > > > include/linux/nbd.h | 6 +++ > > > 2 files changed, 125 insertions(+), 31 deletions(-) > > > > > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > > > index 4bc2a5c..cc4a98a 100644 > > > --- a/drivers/block/nbd.c > > > +++ b/drivers/block/nbd.c > > > @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int lock) > > > > > > static void nbd_xmit_timeout(unsigned long arg) > > > { > > > - struct task_struct *task = (struct task_struct *)arg; > > > + struct nbd_device *nbd = (struct nbd_device *)arg; > > > + struct task_struct *task_ary[2]; > > > + unsigned long flags; > > > + int i; > > > > > > - printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n", > > > - task->comm, task->pid); > > > - force_sig(SIGKILL, task); > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + nbd->timedout = 1; > > > + task_ary[0] = nbd->sender; > > > + task_ary[1] = nbd->receiver; > > > + for (i = 0; i < 2; i++) { > > > + if (task_ary[i] == NULL) > > > + continue; > > > + printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n", > > > + task_ary[i]->comm, task_ary[i]->pid); > > > + force_sig(SIGKILL, task_ary[i]); > > > + break; > > > + } > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > } > > > > > > /* > > > @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size, > > > struct msghdr msg; > > > struct kvec iov; > > > sigset_t blocked, oldset; > > > - unsigned long pflags = current->flags; > > > + unsigned long flags, pflags = current->flags; > > > > I would prefer variable declerations in seperate lines, especially if > > there is an assignment involved, for readability. > > Feel free to change it to whatever style you prefer. > > > > > if (unlikely(!sock)) { > > > dev_err(disk_to_dev(nbd->disk), > > > @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size, > > > msg.msg_controllen = 0; > > > msg.msg_flags = msg_flags | MSG_NOSIGNAL; > > > > > > - if (send) { > > > - struct timer_list ti; > > > - > > > - if (nbd->xmit_timeout) { > > > - init_timer(&ti); > > > - ti.function = nbd_xmit_timeout; > > > - ti.data = (unsigned long)current; > > > - ti.expires = jiffies + nbd->xmit_timeout; > > > - add_timer(&ti); > > > + if (nbd->xmit_timeout) { > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + if (nbd->timedout) { > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + printk(KERN_WARNING > > > + "nbd (pid %d: %s) timed out\n", > > > + task_pid_nr(current), current->comm); > > > + result = -EINTR; > > > + sock_shutdown(nbd, !send); > > > + break; > > > > I am not a big fan of having this block with a sock_shutdown at two > > points in sock_xmit, here as reaction on nbd->timedout and below as > > handler of the signal received. > > Me neither, feel free to change it to whatever you prefer as long as it > is functionally the same thing. > The whole patch was not meant to be pushed as-is but rather as an > initial sketch of code to discuss, so far ignored for over a year... Yes, sorry for that. > > > > > } > > > + if (send) > > > + nbd->sender = current; > > > + else > > > + nbd->receiver = current; > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + } > > > + > > > > What happens if the timer interrupt is happens right at this point? > > The other process may just fall out of the kernel_send/recvmsg() call > > through the signal and call into sock_shutdown(). There it would wait > > for the tx_lock (assuming it is receiving). At the same time this > > process would continue into kernel_sendmsg. There is no active timer > > anymore and the other process is waiting in the lock. > > Is there something I am missing or may this be a problem? > > That bit of the code path is somewhat tricky. > Most writes to the socket will be non-blocking as long as the sockbuf > permits. If I understand correctly you are most concerned of the path > where the SIGKILL(timeout) to the sender thread is queued prior to entering > kernel_sendmsg() which would block it indefinitely and also cause the > receiver process to wait for the tx_lock in its shutdown path, right? > When that happens, the semantics should be similar to a regular write() > on the blocking socket -- return -1 and set errno to EINTR unless > sigaction(SA_RESTART) foo was done, etc. > This actually works almost the same way in our case -- the pending > signal will cause kernel_sendmsg() to eventually return -ERESTARTSYS. > You may actually verify this theory by adding: > > if (send) { > force_sig(SIGKILL, current); > result = kernel_sendmsg(sock, &msg, &iov, 1, size); > printk(KERN_WARNING "result=%d\n", result); > if (result >= 0) > flush_signals(current); > } > > Then if you reduce the default sockbuf size (net.ipv4.tcp_wmem) to, say, > 4KB for min/initial/max, a simple dd if=/dev/zero of=/dev/nbdX bs=512K > count=1 oflag=direct should do the trick -- dmesg will reveal what > happened... > > Please correct me if I am wrong. Thanks for the explanation, I forgot that there is a signal pending for the second process as well. I will test this to be sure but it should work then. > > > > > + if (send) > > > result = kernel_sendmsg(sock, &msg, &iov, 1, size); > > > - if (nbd->xmit_timeout) > > > - del_timer_sync(&ti); > > > - } else > > > + else > > > result = kernel_recvmsg(sock, &msg, &iov, 1, size, > > > msg.msg_flags); > > > > > > + if (nbd->xmit_timeout) { > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + if (send) > > > + nbd->sender = NULL; > > > + else > > > + nbd->receiver = NULL; > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + } > > > + > > > if (signal_pending(current)) { > > > siginfo_t info; > > > printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n", > > > @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size, > > > } > > > > > > static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec, > > > - int flags) > > > + int msg_flags) > > > { > > > int result; > > > void *kaddr = kmap(bvec->bv_page); > > > result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset, > > > - bvec->bv_len, flags); > > > + bvec->bv_len, msg_flags); > > > > This renaming from flags to msg_flags seems unrelated. I also don't see > > why this is necessary. Please split this into a separate patch if it is > > necessary. > > This is unrelated, you may well remove it from the patch. > > > > > kunmap(bvec->bv_page); > > > return result; > > > } > > > @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec, > > > /* always call with the tx_lock held */ > > > static int nbd_send_req(struct nbd_device *nbd, struct request *req) > > > { > > > - int result, flags; > > > + int result, msg_flags; > > > > Same as above, please rename msg_flags in a separate patch. > > Perhaps 'irq_flags' for the new 'flags' variable would be better? > > Whatever you prefer. > > > > > struct nbd_request request; > > > - unsigned long size = blk_rq_bytes(req); > > > + unsigned long flags, size = blk_rq_bytes(req); > > > > > > memset(&request, 0, sizeof(request)); > > > request.magic = htonl(NBD_REQUEST_MAGIC); > > > @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req) > > > } > > > memcpy(request.handle, &req, sizeof(req)); > > > > > > + if (nbd->xmit_timeout) { > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + if (!nbd->inflight) { > > > + nbd->req_timer.function = nbd_xmit_timeout; > > > + nbd->req_timer.data = (unsigned long)nbd; > > > > Isn't it possible to assign the above two at initialization? > > I think the first two should be ok to move to the init code. > > > > > + nbd->req_timer.expires = jiffies + nbd->xmit_timeout; > > > + add_timer(&nbd->req_timer); > > > + } > > > + nbd->inflight++; > > > + BUG_ON(nbd->inflight <= 0); > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + } > > > + > > > dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n", > > > nbd->disk->disk_name, req, > > > nbdcmd_to_ascii(nbd_cmd(req)), > > > @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req) > > > * whether to set MSG_MORE or not... > > > */ > > > rq_for_each_segment(bvec, req, iter) { > > > - flags = 0; > > > + msg_flags = 0; > > > if (!rq_iter_last(bvec, iter)) > > > - flags = MSG_MORE; > > > + msg_flags = MSG_MORE; > > > dprintk(DBG_TX, "%s: request %p: sending %d bytes data\n", > > > nbd->disk->disk_name, req, bvec.bv_len); > > > - result = sock_send_bvec(nbd, &bvec, flags); > > > + result = sock_send_bvec(nbd, &bvec, msg_flags); > > > if (result <= 0) { > > > dev_err(disk_to_dev(nbd->disk), > > > "Send data failed (result %d)\n", > > > @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct request *req) > > > return 0; > > > > > > error_out: > > > + if (nbd->xmit_timeout) { > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + nbd->inflight--; > > > + BUG_ON(nbd->inflight < 0); > > > > nbd->inflight was checked above already with BUG_ON(). > > The BUG_ON()'s may well be removed -- this is a leftover from my initial > testing of this patch to make sure the counters do not overflow or go > negative -- do whatever you want with it (I'd leave them as they are, > but it's up to you). > > > > > + if (!nbd->inflight) > > > + del_timer_sync(&nbd->req_timer); > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + } > > > return -EIO; > > > } > > > > > > @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = { > > > static int nbd_do_it(struct nbd_device *nbd) > > > { > > > struct request *req; > > > + unsigned long flags; > > > int ret; > > > > > > BUG_ON(nbd->magic != NBD_MAGIC); > > > > > > sk_set_memalloc(nbd->sock->sk); > > > - nbd->pid = task_pid_nr(current); > > > ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr); > > > if (ret) { > > > dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n"); > > > - nbd->pid = 0; > > > return ret; > > > } > > > > > > - while ((req = nbd_read_stat(nbd)) != NULL) > > > - nbd_end_request(req); > > > + for (;;) { > > > + req = nbd_read_stat(nbd); > > > + if (nbd->xmit_timeout) { > > > + spin_lock_irqsave(&nbd->timer_lock, flags); > > > + if (req != NULL) { > > > + nbd->inflight--; > > > + BUG_ON(nbd->inflight < 0); > > > + } > > > + if (req != NULL && nbd->inflight) > > > + mod_timer(&nbd->req_timer, > > > + jiffies + nbd->xmit_timeout); > > > + else > > > + del_timer_sync(&nbd->req_timer); > > > + spin_unlock_irqrestore(&nbd->timer_lock, flags); > > > + } > > > + if (req != NULL) { > > > + nbd_end_request(req); > > > + continue; > > > + } > > > + break; > > > + } > > > > Please make this loop a bit more readable. Perhaps something like this? > > > > while ((req = nbd_read_stat(nbd)) != NULL) { > > nbd_end_request(req); > > if (nbd->xmit_timeout) { > > ... > > } > > } > > if (nbd->xmit_timeout) { > > lock(); > > del_timer_sync(); > > unlock(); > > } > > I'm fine with either -- a loop is a loop. Rework it any way which makes > you more comfortable to work with this code later. > > > > > device_remove_file(disk_to_dev(nbd->disk), &pid_attr); > > > - nbd->pid = 0; > > > > Why is this removed? > > Not sure over a year after the initial patch submission but looking at > my commit message it must have something to do with its last bit -- the > non-locked access to nbd->pid? > > > > > return 0; > > > } > > > > > > @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, > > > set_capacity(nbd->disk, nbd->bytesize >> 9); > > > return 0; > > > > > > - case NBD_SET_TIMEOUT: > > > - nbd->xmit_timeout = arg * HZ; > > > + case NBD_SET_TIMEOUT: { > > > + int xt; > > > + > > > + xt = arg * HZ; > > > + if (xt < 0) > > > + return -EINVAL; > > > + if (nbd->pid && > > > > Could you use nbd->sock here please? This doesn't really depend on a pid > > does it? > > I remember that when I started looking at this code initially I found > the whole nbd->pid / nbd->sock kind of redundant... I stuck with using > nbd->pid but maybe it was the wrong way... That's fine. I also think they are kind of redundant. I will probably replace the usage of nbd->pid with nbd->sock so that nbd->pid is just exported as read-only file to userspace as debugging help. > > > > > + ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt))) > > > + return -EBUSY; > > > + dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n", > > > + nbd->xmit_timeout / HZ, xt / HZ); > > > > Please use dev_dbg here. > > Feel free to change it to what now is preferred. > > > > > + nbd->xmit_timeout = xt; > > > return 0; > > > + } > > > > > > case NBD_SET_FLAGS: > > > nbd->flags = arg; > > > @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, > > > if (!nbd->sock) > > > return -EINVAL; > > > > > > + nbd->pid = task_pid_nr(current); > > > + nbd->inflight = 0; > > > + nbd->timedout = 0; > > > + nbd->sender = NULL; > > > + nbd->receiver = NULL; > > > mutex_unlock(&nbd->tx_lock); > > > > > > if (nbd->flags & NBD_FLAG_READ_ONLY) > > > @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, > > > nbd->disk->disk_name); > > > if (IS_ERR(thread)) { > > > mutex_lock(&nbd->tx_lock); > > > + nbd->pid = 0; > > > > I can't really see why it is necessary to get nbd->pid locked? And if > > you need this, please split it into a separate patch. > > I think it has to be locked for multiple concurrent nbd-client > invocations, doesn't it? > For already established /dev/nbdX you may run things like nbd-client -l > which end up running exactly the same kernel code and using the same nbd > structure in the kernel, right? Or am I being paranoid here? Yes they are using nbd->pid as well. But there shouldn't be any concurrent users of this that highly depend on this. But I see why it may be better to lock this. Thanks, Markus Pargmann > > > > > return PTR_ERR(thread); > > > } > > > wake_up_process(thread); > > > @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, > > > kthread_stop(thread); > > > > > > mutex_lock(&nbd->tx_lock); > > > + nbd->pid = 0; > > > if (error) > > > return error; > > > sock_shutdown(nbd, 0); > > > @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, > > > sockfd_put(sock); > > > nbd->flags = 0; > > > nbd->bytesize = 0; > > > + nbd->xmit_timeout = 0; > > > bdev->bd_inode->i_size = 0; > > > set_capacity(nbd->disk, 0); > > > if (max_part > 0) > > > @@ -874,6 +960,8 @@ static int __init nbd_init(void) > > > init_waitqueue_head(&nbd_dev[i].waiting_wq); > > > nbd_dev[i].blksize = 1024; > > > nbd_dev[i].bytesize = 0; > > > + spin_lock_init(&nbd_dev[i].timer_lock); > > > + init_timer(&nbd_dev[i].req_timer); > > > disk->major = NBD_MAJOR; > > > disk->first_minor = i << part_shift; > > > disk->fops = &nbd_fops; > > > diff --git a/include/linux/nbd.h b/include/linux/nbd.h > > > index f62f78a..c1280ca 100644 > > > --- a/include/linux/nbd.h > > > +++ b/include/linux/nbd.h > > > @@ -41,6 +41,12 @@ struct nbd_device { > > > pid_t pid; /* pid of nbd-client, if attached */ > > > int xmit_timeout; > > > int disconnect; /* a disconnect has been requested by user */ > > > + spinlock_t timer_lock; > > > + struct timer_list req_timer; > > > + int inflight; > > > + int timedout; > > > > Do we really need 'timedout'? or is it possible to use the pending > > signal to immediately shutdown the nbd? > > Hmm... possibly. > > Kind regards, > > -- > Michal Belczyk Sr. > -- Pengutronix e.K. | | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Attachment:
signature.asc
Description: Digital signature