Re: [PATCH V2] nbd: fix partial sending
Am 18.10.2024 um 16:08 hat Ming Lei geschrieben:
> nbd driver sends request header and payload with multiple call of
> sock_sendmsg, and partial sending can't be avoided. However, nbd driver
> returns BLK_STS_RESOURCE to block core in this situation. This way causes
> one issue: request->tag may change in the next run of nbd_queue_rq(), but
> the original old tag has been sent as part of header cookie, this way
> confuses nbd driver reply handling, since the real request can't be
> retrieved any more with the obsolete old tag.
>
> Fix it by retrying sending directly in per-socket work function,
> meantime return BLK_STS_OK to block layer core.
>
> Cc: vincent.chen@sifive.com
> Cc: Leon Schuermann <leon@is.currently.online>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Reported-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> - move pending retry to socket work function and return BLK_STS_OK, so that
> userspace can get chance to handle the signal(Kevin)
So first of all: This seems to work.
But looking through the code I have a few comments:
> drivers/block/nbd.c | 89 +++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 77 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index b852050d8a96..855f4a79e37c 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -62,6 +62,7 @@ struct nbd_sock {
> bool dead;
> int fallback_index;
> int cookie;
> + struct work_struct work;
> };
>
> struct recv_thread_args {
> @@ -141,6 +142,9 @@ struct nbd_device {
> */
> #define NBD_CMD_INFLIGHT 2
>
> +/* Just part of request header or data payload is sent successfully */
> +#define NBD_CMD_PARTIAL_SEND 3
> +
> struct nbd_cmd {
> struct nbd_device *nbd;
> struct mutex lock;
> @@ -466,6 +470,12 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req)
> if (!mutex_trylock(&cmd->lock))
> return BLK_EH_RESET_TIMER;
>
> + /* partial send is handled in nbd_sock's work function */
> + if (test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags)) {
> + mutex_unlock(&cmd->lock);
> + return BLK_EH_RESET_TIMER;
> + }
> +
> if (!test_bit(NBD_CMD_INFLIGHT, &cmd->flags)) {
> mutex_unlock(&cmd->lock);
> return BLK_EH_DONE;
> @@ -614,6 +624,27 @@ static inline int was_interrupted(int result)
> return result == -ERESTARTSYS || result == -EINTR;
> }
>
> +/*
> + * We've already sent header or part of data payload, have no choice but
> + * to set pending and schedule it in work.
> + *
> + * And we have to return BLK_STS_OK to block core, otherwise this same
> + * request may be re-dispatched with different tag, but our header has
> + * been sent out with old tag, and this way does confuse reply handling.
> + */
> +static void nbd_run_pending_work(struct nbd_device *nbd,
> + struct nbd_sock *nsock,
> + struct nbd_cmd *cmd, int sent)
The name of this function is a bit confusing, we don't actually run
anything here. Maybe nbd_schedule_pending_work()? Or something else with
"schedule" and "send".
> +{
> + struct request *req = blk_mq_rq_from_pdu(cmd);
> +
> + nsock->pending = req;
> + nsock->sent = sent;
> + set_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags);
> + refcount_inc(&nbd->config_refs);
> + schedule_work(&nsock->work);
> +}
Important point about this function: It doesn't check if we already have
scheduled the work. You seem to have some code in nbd_pending_cmd_work()
that can cope with being scheduled multiple times, but allowing the
situation makes the control flow hard to follow. It would probably be
better to avoid this and call refcount_inc() and schedule_work() only if
NBD_CMD_PARTIAL_SEND isn't already set.
> +
> /*
> * Returns BLK_STS_RESOURCE if the caller should retry after a delay.
> * Returns BLK_STS_IOERR if sending failed.
> @@ -699,11 +730,12 @@ static blk_status_t nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd,
> * completely done.
> */
> if (sent) {
> - nsock->pending = req;
> - nsock->sent = sent;
> + nbd_run_pending_work(nbd, nsock, cmd, sent);
> + return BLK_STS_OK;
Now the question is if requeuing is already implicitly avoided, because
returning EINTR to a worker doesn't seem to make a lot of sense to me,
and so we might actually never hit this code path from a worker. But I'm
not completely sure. Maybe better to detect the situation in
nbd_run_pending_work() anyway.
> + } else {
> + set_bit(NBD_CMD_REQUEUED, &cmd->flags);
> + return BLK_STS_RESOURCE;
> }
> - set_bit(NBD_CMD_REQUEUED, &cmd->flags);
> - return BLK_STS_RESOURCE;
This is an unrelated style change.
> }
> dev_err_ratelimited(disk_to_dev(nbd->disk),
> "Send control failed (result %d)\n", result);
> @@ -737,14 +769,8 @@ static blk_status_t nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd,
> result = sock_xmit(nbd, index, 1, &from, flags, &sent);
> if (result < 0) {
> if (was_interrupted(result)) {
> - /* We've already sent the header, we
> - * have no choice but to set pending and
> - * return BUSY.
> - */
> - nsock->pending = req;
> - nsock->sent = sent;
> - set_bit(NBD_CMD_REQUEUED, &cmd->flags);
> - return BLK_STS_RESOURCE;
> + nbd_run_pending_work(nbd, nsock, cmd, sent);
> + return BLK_STS_OK;
> }
> dev_err(disk_to_dev(nbd->disk),
> "Send data failed (result %d)\n",
> @@ -778,6 +804,44 @@ static blk_status_t nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd,
> return BLK_STS_OK;
> }
>
> +/* handle partial sending */
> +static void nbd_pending_cmd_work(struct work_struct *work)
> +{
> + struct nbd_sock *nsock = container_of(work, struct nbd_sock, work);
> + struct request *req = nsock->pending;
> + struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
> + struct nbd_device *nbd = cmd->nbd;
> + unsigned long deadline = READ_ONCE(req->deadline);
> + unsigned int wait_ms = 2;
> +
> + mutex_lock(&cmd->lock);
> +
> + WARN_ON_ONCE(test_bit(NBD_CMD_REQUEUED, &cmd->flags));
> + if (!test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags))
> + goto out;
I believe this check is only for detecting if the work was scheduled
multiple times? If we avoid that above, we probably can make this a
WARN_ON_ONCE(), too.
> +
> + mutex_lock(&nsock->tx_lock);
> + while (true) {
> + nbd_send_cmd(nbd, cmd, cmd->index);
> + if (!nsock->pending)
> + break;
If it's true that we can never get EINTR or ERESTARTSYS in a worker,
then this condition would always be true and the loop and sleeping logic
is unnecessary.
If it actually is necessary, I wonder if it wouldn't be better to just
let nbd_send_cmd() reschedule the work without a loop and sleeping here.
I'm not entirely sure what EINTR/ERESTARTSYS should even mean in this
context, but like in the .queue_rq() path, the thing that clears these
error conditions would probably be returning, not sleeping?
> +
> + /* don't bother timeout handler for partial sending */
> + if (READ_ONCE(jiffies) + msecs_to_jiffies(wait_ms) >= deadline) {
> + cmd->status = BLK_STS_IOERR;
> + blk_mq_complete_request(req);
> + break;
> + }
> + msleep(wait_ms);
> + wait_ms *= 2;
> + }
> + mutex_unlock(&nsock->tx_lock);
> + clear_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags);
> +out:
> + mutex_unlock(&cmd->lock);
> + nbd_config_put(nbd);
> +}
Kevin
Reply to: