[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting

Op 14-11-13 08:58, Jan Kara schreef:
> On Wed 13-11-13 05:59:11, Denys Fedoryshchenko wrote:
>> Hi
>> On 2013-11-12 23:46, Jan Kara wrote:
>>> Hello,
>>> On Tue 12-11-13 16:34:07, Denys Fedoryshchenko wrote:
>>>> I just did some fault testing for test nbd setup, and found that if
>>>> i reboot nbd server i will get immediately BUG() message on nbd
>>>> client and filesystem that i cannot unmount, and any operations on
>>>> it will freeze and lock processes trying to access it.
>>>  So how exactly did you do the fault testing? Because it seems
>>> something
>>> has discarded the block device under filesystem's toes and the
>>> superblock
>>> buffer_head got unmapped. Didn't something call NBD_CLEAR_SOCK ioctl?
>>> Because that calls kill_bdev() which would do exactly that...
>> Client side:
>> modprobe nbd
>> nbd-client /dev/nbd0 -name export1
>> nbd-client /dev/nbd1 -name export2
>> nbd-client /dev/nbd2 -name export3
>> mount /dev/nbd0 /mnt/disk1
>> mount /dev/nbd1 /mnt/disk2
>> mount /dev/nbd2 /mnt/disk3
>> On server i have config:
>> [generic]
>> [export1]
>>         exportname = /dev/sda1
>> [export2]
>>         exportname = /dev/sdb1
>> [export3]
>>         exportname = /dev/sdc1
>> Steps to reproduce:
>> 1)Start some large file copy on client side to /mnt/disk1/
>> 2)Reboot server. It reboots quite fast, just few seconds, server
>> system will get ip before nbd-server process started listening, so
>> probably nbd-client will see connection refused.
>> 3)seems when client gets connection refused - it is going mad
>> I can try to capture traffic dump, or do any other debug operation,
>> please let me know, what i should run :)
>> P.S. I noticed maybe i should run persist mode, but anyway it should
>> not crash like this i think.
>   OK, no need for further debugging. I see what's going on. In NBD_DO_IT
> ioctl() nbd calls kill_bdev() after the kthread returned - and that happens
> in your case as we can see from "queue cleared" messages.
> Now there is a question how to fix this. Filesystems don't really expect
> device buffers to disappear under us as they do when nbd calls kill_bdev().
> Also that never happens with normal block devices - if a similar situation
> happens to SCSI / SATA disk, corresponding block devices hang around
> refusing any IO until the filesystem is unmounted and at that point they
> disappear (device's refcount - bd_openers - reaches zero). It would be good
> if NBD behaved the same way - maybe we should return from NBD_DO_IT ioctl
> only after bd_openers drops to 1 (not zero because the nbd client has the
> device open as well for the ioctl if I'm right)?

I'm not sure if this has been implemented that way (that's Paul's area,
not mine), but the intention was that the nbd kernel module would only
do cleanup once the nbd-client process exits. That is, if nbd-client has
not yet exited, that could be because it's in -persist mode and is
trying to reconnect.

Once it does exit, that means we've definitely lost the connection to
the server, and it's not coming back (at least not without user
intervention). Keeping the connection open at that point is a bad idea;
and keeping the device "alive" but blocked probably is, too, since it
would only result in processes trying to use a blocked device. I've seen
such a situation resulting in a system eventually ending up completely
deadlocked in userspace. The same would happen on in-system block
devices; but the difference there is that a loss of connection to an
in-system block device is massively less likely than a loss of
connection to a block device somewhere on the network, and if a SCSI
disc doesn't reply anymore, that probably means you've got hardware
failure, not "just" network issues. While I agree this is pretty
problematic if there's a filesystem running on the device, it's not so
much of a problem for other uses of nbd.

However, I should note that earlier versions of nbd-client would give up
connecting as soon as they got a "connection refused" from the server;
this was fixed with commit 6abdd46853, which is in nbd 3.2 and later.

This end should point toward the ground if you want to go to space.

If it starts pointing toward space you are having a bad problem and you
will not go to space today.

  -- http://xkcd.com/1133/

Reply to: