[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [PATCH] nbd: fix false lockdep deadlock warning




On 7/5/25 6:45 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/02 15:30, Yu Kuai 写道:
>> Hi,
>>
>> 在 2025/07/02 14:22, Nilay Shroff 写道:
>>>
>>>
>>> On 7/2/25 8:02 AM, Ming Lei wrote:
>>>> On Wed, Jul 02, 2025 at 09:12:09AM +0800, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> 在 2025/07/01 21:28, Nilay Shroff 写道:
>>>>>>
>>>>>>
>>>>>> On 6/28/25 6:18 AM, Yu Kuai wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> 在 2025/06/27 19:04, Ming Lei 写道:
>>>>>>>> I guess the patch in the following link may be simper, both two take
>>>>>>>> similar approach:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/linux-block/aFjbavzLAFO0Q7n1@fedora/
>>>>>>>
>>>>>>> I this the above approach has concurrent problems if nbd_start_device
>>>>>>> concurrent with nbd_start_device:
>>>>>>>
>>>>>>> t1:
>>>>>>> nbd_start_device
>>>>>>> lock
>>>>>>> num_connections = 1
>>>>>>> unlock
>>>>>>>       t2:
>>>>>>>       nbd_add_socket
>>>>>>>       lock
>>>>>>>       config->num_connections++
>>>>>>>       unlock
>>>>>>>           t3:
>>>>>>>           nbd_start_device
>>>>>>>           lock
>>>>>>>           num_connections = 2
>>>>>>>           unlock
>>>>>>>           blk_mq_update_nr_hw_queues
>>>>>>>
>>>>>>> blk_mq_update_nr_hw_queues
>>>>>>> //nr_hw_queues updated to 1 before failure
>>>>>>> return -EINVAL
>>>>>>>
>>>>>>
>>>>>> In the above case, yes I see that t1 would return -EINVAL (as
>>>>>> config->num_connections doesn't match with num_connections)
>>>>>> but then t3 would succeed to update nr_hw_queue (as both
>>>>>> config->num_connections and num_connections set to 2 this
>>>>>> time). Isn't it? If yes, then the above patch (from Ming)
>>>>>> seems good.
>>>>>
>>>>> Emm, I'm confused, If you agree with the concurrent process, then
>>>>> t3 update nr_hw_queues to 2 first and return sucess, later t1 update
>>>>> nr_hw_queues back to 1 and return failure.
>>>>
>>>> It should be easy to avoid failure by simple retrying.
>>>>
>>> Yeah I think retry should be a safe bet here.
>>>
>>
>> I really not sure about the retry, the above is just a scenario that I
>> think of with a quick review, and there are still many concurrent
>> scenarios that need to be checked, I'm kind of lost here.
>>
>> Except nbd_start_device() and nbd_add_socked(), I'm not confident
>> other context that is synchronized with config_lock is not broken.
>> However, I'm ok with the bet.
>>
>>> On another note, synchronizing nbd_start_device and nbd_add_socket
>>> using nbd->task_setup looks more complex and rather we may use
>>> nbd->pid to synchronize both. We need to move setting of nbd->pid
>>> before we invoke blk_mq_update_nr_hw_queues in nbd_start_device.
>>> Then in nbd_add_socket we can evaluate nbd->pid and if it's
>>> non-NULL then we could assume that either nr_hw_queues update is in
>>> progress or device has been setup and so return -EBUSY. I think
>>> anyways updating number of connections once device is configured
>>> would not be possible, so once nbd_start_device is initiated, we
>>> shall prevent user adding more connections. If we follow this
>>> approach then IMO we don't need to add retry discussed above.
>>
>> It's ok for me to forbit nbd_add_socked after nbd is configured, there
>> is nowhere to use the added sock. And if there really are other contexts
>> need to be synchronized, I think nbd->pid can be used as well.
>>
> 
> Do we have a conclusion now? Feel free to send the retry version, or let
> me know if I should send a new synchronize version.
> 
Personally, I prefer synchronizing nbd_start_device and nbd_add_socket
using nbd->pid but I do agree retry version would also work. Having
said that, lets wait for Ming's feedback as well.

Thanks,
--Nilay


Reply to: