[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [PATCH v2 3/3] nbd: fix race between nbd_alloc_config() and module removal



Hi,

On 9/6/2021 6:25 PM, Christoph Hellwig wrote:
> On Mon, Sep 06, 2021 at 06:08:54PM +0800, Hou Tao wrote:
>>>> +	if (!try_module_get(THIS_MODULE))
>>>> +		return ERR_PTR(-ENODEV);
>>> try_module_get(THIS_MODULE) is an indicator for an unsafe pattern.  If
>>> we don't already have a reference it could never close the race.
>>>
>>> Looking at the callers:
>>>
>>>  - nbd_open like all block device operations must have a reference
>>>    already.
>> Yes. nbd_open() has already taken a reference in dentry_open().
>>>  - for nbd_genl_connect I'm not an expert, but given that struct
>>>    nbd_genl_family has a module member I suspect the networkinh
>>>    code already takes a reference.
>> That was my original though, but the fact is netlink code doesn't take a module reference
>>
>> in genl_family_rcv_msg_doit() and netlink uses genl_lock_all() to serialize between module removal
>>
>> and nbd_connect_genl_ops calling, so I think use try_module_get() is OK here.
> How it this going to work?  If there was a race you just shortened it,
> but it can still happen before you call try_module_get.  So I think we
> need to look into how the netlink calling conventions are supposed to
> look and understand the issues there first.
> .

Let me explain first. The reason it works is due to genl_lock_all() in netlink code.

If the module removal happens before calling try_module_get(), nbd_cleanup() will

call genl_unregister_family() first, and then genl_lock_all(). genl_lock_all() will

prevent ops in nbd_connect_genl_ops() from being called, because the calling

of nbd ops happens in genl_rcv() which needs to acquire cb_lock first.


process A                                       process B

module removal

genl_unregister_family()

  genl_lock_all()

    down_write(&cb_lock)

                                                receive a new netlink message

                                                genl_rcv()

                                                   // will wait for the removal of nbd ops

                                                   down_read(&cb_lock)

If nbd_alloc_config() happens before the module removal, genl_rcv() must

have been acquired cb_lock & genl_mutex, so nbd_cleanup() will block in

genl_unregister_family(). When nbd_alloc_config() calls try_module_get(),

it will find out the module is dying, so fail nbd_genl_connect().


process A                                     process B

a new netlink message

genl_rcv()

  down_read(&cb_lock)

    mutex_lock(&genl_mutex)

      nbd_genl_connect()

        nbd_alloc_config()

                                               module removal

                                               genl_unregister_family

          // module is dying, so fail

          try_module_get()

                                                 genl_lock_all()

                                                   // wait for the completion of nbd ops

                                                   down_write(&cb_lock)

I have checked multiple genl_ops, it seems that the reason why these genl_ops

don't need try_module_get() is that these ops don't create new object through

genl_ops and just control it. However genl_family_rcv_msg_dumpit() will try to

call try_module_get(), but according to the history (6dc878a8ca39 "netlink: add reference of module in netlink_dump_start"),

it is because inet_diag_handler_cmd() will call __netlink_dump_start().

Regards,

Tao



Reply to: