Bug#772733: linux-image-3.16-0.bpo.3-amd64: [blk-mq] blk queue freezing unreliable

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#772733: linux-image-3.16-0.bpo.3-amd64: [blk-mq] blk queue freezing unreliable
From: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Date: Wed, 10 Dec 2014 11:01:24 -0500
Message-id: <[🔎] 20141210160124.10921.60025.reportbug@sri-vm5>
Reply-to: Sebastian Parschauer <sebastian.riemer@profitbricks.com>, 772733@bugs.debian.org
Package: src:linux
Version: 3.16.5-1~bpo70+1
Severity: grave
Tags: upstream
Justification: renders package unusable

Dear Maintainer,

we are running Debian Wheezy Qemu/KVM VMs with the mentioned backported
kernel inside and test CPU, memory and virtio-blk block device hot-plug/
unplug.
With most other distributions/kernels there is no issue at all but with
the 3.16 kernel we encounter several issues with queue freezing in the
new blk-mq code which has been put in-place for the old block layer code.

We can reproduce two related issues.

##########################################################

Issue 1: Happened during virtio-blk device unplug:

BUG: unable to handle kernel paging request at 0000000000200028
IP: [<ffffffff81545ddf>] __mutex_lock_slowpath+0x3f/0x1c0
PGD 0
Oops: 0000 [#12] SMP
Modules linked in: loop joydev hid_generic usbhid hid crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel parport_pc pvpanic parport button processor thermal_sys aesni_intel psmouse serio_raw aes_x86_64 lrw pcspkr gf128mul glue_helper ablk_helper cryptd evdev i2c_piix4 i2c_core ext4 crc16 mbcache jbd2 ata_generic virtio_net virtio_blk floppy ata_piix libata scsi_mod uhci_hcd ehci_hcd crc32c_intel usbcore usb_common virtio_pci virtio_ring virtio
CPU: 1 PID: 20077 Comm: dd Tainted: G      D       3.16-0.bpo.3-amd64 #1 Debian 3.16.5-1~bpo70+1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
task: ffff8800338b6210 ti: ffff88003d3fc000 task.ti: ffff88003d3fc000
RIP: 0010:[<ffffffff81545ddf>]  [<ffffffff81545ddf>] __mutex_lock_slowpath+0x3f/0x1c0
RSP: 0018:ffff88003d3ffcd8  EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff88003d07b158 RCX: 01ffff8000000080
RDX: 0000000000200000 RSI: ffffffff81896b60 RDI: ffff88003d07b158
RBP: ffff88003d07b120 R08: 0000000fffffffe0 R09: ffff88003d3ffcb0
R10: ffff88003d530590 R11: 0000000000000000 R12: ffff8800338b6210
R13: ffff88003bed0c00 R14: ffff88003d457840 R15: ffffffffa005f160
FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000200028 CR3: 0000000001812000 CR4: 00000000000406e0
Stack:
 ffff88002f2fd5d8 0000000000000046 ffff88003d07b4b8 ffffffff81896b60
 ffff88003d07b158 ffff88003d07b120 ffff88003d07b158 ffff88003bed0c00
 ffff88003d457840 ffffffff81545f7b ffff88002f2fd008 ffffffff812a75c4
Call Trace:
 [<ffffffff81545f7b>] ? mutex_lock+0x1b/0x2a
 [<ffffffff812a75c4>] ? blk_mq_free_queue+0x24/0x150
 [<ffffffff8129e7c8>] ? blk_release_queue+0x88/0xd0
 [<ffffffff812ca160>] ? kobject_cleanup+0x80/0x1d0
 [<ffffffff812abba2>] ? disk_release+0x92/0xd0
 [<ffffffff813c4f3b>] ? device_release+0x3b/0xb0
 [<ffffffff812ca160>] ? kobject_cleanup+0x80/0x1d0
 [<ffffffff811f2095>] ? __blkdev_put+0x115/0x1a0
 [<ffffffff811f2285>] ? blkdev_close+0x25/0x30
 [<ffffffff811bd323>] ? __fput+0xb3/0x210
 [<ffffffff8108c164>] ? task_work_run+0xc4/0xe0
 [<ffffffff8106f310>] ? do_exit+0x2c0/0xa80
 [<ffffffff8106fb56>] ? do_group_exit+0x46/0xb0
 [<ffffffff8106fbd7>] ? SyS_exit_group+0x17/0x20
 [<ffffffff8154792d>] ? system_call_fast_compare_end+0x10/0x15
Code: 55 53 48 89 fb 48 83 ec 20 65 48 8b 04 25 48 c8 00 00 48 8b 80 38 c0 ff ff a8 08 75 29 48 8b 57 18 b8 01 00 00 00 48 85 d2 74 03 <8b> 42 28 85 c0 74 14 4c 8d 6b 20 4c 89 ef e8 0e eb b6 ff 84 c0
RIP  [<ffffffff81545ddf>] __mutex_lock_slowpath+0x3f/0x1c0
 RSP <ffff88003d3ffcd8>
CR2: 0000000000200028
---[ end trace 6dcfb6128736213b ]---
Fixing recursive fault but reboot is needed!

###############################################################

Issue 2: Happened during CPU hotplug:

CPU2 has been hot-added
smpboot: Booting Node 0 Processor 2 APIC 0x2
kvm-clock: cpu 2, msr 0:1ffef081, secondary cpu clock
TSC synchronization [CPU#0 -> CPU#2]:
Measured 732976580583 cycles TSC warp between CPUs, turning off TSC clock.
tsc: Marking TSC unstable due to check_tsc_sync_source failed
KVM setup async PF for cpu 2
kvm-stealtime: cpu 2, msr 1db0e080
microcode: CPU2 sig=0x206a1, pf=0x1, revision=0x1
platform microcode: firmware: failed to load intel-ucode/06-2a-01 (-2)
Will online and init hotplugged CPU: 2
kobject '(null)' (ffff88001db1a858): tried to add an uninitialized object, something is seriously wrong.
CPU: 0 PID: 429 Comm: udevd Not tainted 3.16-0.bpo.3-amd64 #1 Debian 3.16.5-1~bpo70+1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 0000000000000000 ffff88001ee18270 ffffffff8154144f ffff88001db1a858
 ffffffff812cae00 ffff880019e93d28 ffffffff00000020 ffff880019e93d68
 ffff880019e93d18 ffff880000000002 0000000300000003 0000000000000007
Call Trace:
[<ffffffff8154144f>] ? dump_stack+0x41/0x51
[<ffffffff812cae00>] ? kobject_add+0xc0/0xd0
[<ffffffff812a84dc>] ? blk_mq_register_hctx+0xbc/0xd0
[<ffffffff812a8eee>] ? blk_mq_sysfs_register+0x3e/0x50
[<ffffffff812a6802>] ? blk_mq_queue_reinit_notify+0x72/0xc0
[<ffffffff810939ad>] ? notifier_call_chain+0x4d/0x70
[<ffffffff8106cfa0>] ? __cpu_notify+0x20/0x40
[<ffffffff8106d28d>] ? _cpu_up+0x10d/0x170
[<ffffffff8106d369>] ? cpu_up+0x79/0x90
[<ffffffff8105ceda>] ? __do_page_fault+0x29a/0x530
[<ffffffff815380a2>] ? cpu_subsys_online+0x52/0xb0
[<ffffffff813c7605>] ? device_online+0x75/0xa0
[<ffffffff813c76b5>] ? online_store+0x85/0x90
[<ffffffff8122dbe2>] ? kernfs_fop_write+0x112/0x170
[<ffffffff811bbe25>] ? vfs_write+0xc5/0x1f0
[<ffffffff811bc330>] ? SyS_write+0x50/0xb0
[<ffffffff8154792d>] ? system_call_fast_compare_end+0x10/0x15
intel_rapl: RAPL domain package detection failed
intel_rapl: RAPL domain core detection failed
intel_rapl: RAPL domain uncore detection failed
intel_rapl: RAPL domain dram detection failed
intel_rapl: no valid rapl domains found in package 0

######################################

There are related upstream patches which didn't make it to linux-stable
and not to your kernel.

$ git log --reverse --oneline v3.16..v3.18-rc7 block/blk-mq.c
531ed62 blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
776687b block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
780db20 blk-mq: decouble blk-mq freezing from generic bypassing
72d6f02 blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
add703f blk-mq: use percpu_ref for mq usage count

Related discussion:
http://lkml.iu.edu/hypermail/linux/kernel/1406.2/02456.html

Please consider applying these patches and sending them to linux-stable!
We are testing the kernel with these patches applied at the moment.

#####################################

Analysis of issue 1:

Access to invalid memory address. A "dd" process inside the VM is doing I/O to
the block device to be unplugged. The block device gets unplugged while dd is
still accessing it. dd calls exit() resulting in closing all open file
desciptors. This causes a double free of parts of the block queue as queue
freezing didn't work correctly. The dd process came through although the queue
should have been frozen. The usage of the queue by dd remains undetected.

(gdb) list *blk_mq_free_queue+0x1f
0xffffffff812a75bf is in blk_mq_free_queue (/build/linux-LrLd2z/linux-3.16.5/block/blk-mq.c:1742).
1737    {
1738            struct blk_mq_tag_set *set = q->tag_set;
1739
1740            blk_mq_freeze_queue(q);
1741
1742            mutex_lock(&set->tag_list_lock);
1743            list_del_init(&q->tag_set_list);
1744            blk_mq_update_tag_set_depth(set);
1745            mutex_unlock(&set->tag_list_lock);
1746

(gdb) list *__mutex_lock_slowpath+0x3f
0xffffffff81545ddf is in __mutex_lock_slowpath (/build/linux-LrLd2z/linux-3.16.5/kernel/locking/mutex
..c:181).
176                     return 0;
177     
178             rcu_read_lock();
179             owner = ACCESS_ONCE(lock->owner);
180             if (owner)
181                     retval = owner->on_cpu;
182             rcu_read_unlock();
183             /*
184              * if lock->owner is not set, the mutex owner may have just acquired
185              * it and not set the owner yet or the mutex has been released.
(gdb)
(gdb)
(gdb) disassemble /m __mutex_lock_slowpath,+0x45
....
178             rcu_read_lock();
179             owner = ACCESS_ONCE(lock->owner);
   0xffffffff81545dd1 <__mutex_lock_slowpath+49>:       mov    0x18(%rdi),%rdx

180             if (owner)
   0xffffffff81545dda <__mutex_lock_slowpath+58>:       test   %rdx,%rdx
   0xffffffff81545ddd <__mutex_lock_slowpath+61>:       je     0xffffffff81545de2 <__mutex_lock_slowpath+66>

181                     retval = owner->on_cpu;
   0xffffffff81545ddf <__mutex_lock_slowpath+63>:       mov    0x28(%rdx),%eax

182             rcu_read_unlock();
....

lock->owner (0x200000 or rdx) became invalid. This has been detected upon its
first dereferencing with lock->owner->on_cpu (0x200028).

###########################################

Analysis of issue 2:

The "udevd" process tries to write to a sysfs file to activate the hot-plugged
CPU. blk_mq_queue_reinit_notify() is called. It involves freezing the queue
and is invoked on all existing queues. If the freezing doesn't work correctly,
then it is possible that queues are draining in parallel. Here this is detected
as a related kobject is detected to be uninitialized.

(gdb) list *blk_mq_queue_reinit_notify+0x6d
0xffffffff812a67fd is in blk_mq_queue_reinit_notify (/build/linux-LrLd2z/linux-3.16.5/block/blk-mq.c:1928).
1923		 * involves free and re-allocate memory, worthy doing?)
1924		 */
1925	
1926		blk_mq_map_swqueue(q);
1927	
1928		blk_mq_sysfs_register(q);
1929	
1930		blk_mq_unfreeze_queue(q);
1931	}
1932	

(gdb) list *kobject_add+0xbb
0xffffffff812cadfb is in kobject_add (/build/linux-LrLd2z/linux-3.16.5/lib/kobject.c:395).
390	
391		if (!kobj->state_initialized) {
392			printk(KERN_ERR "kobject '%s' (%p): tried to add an "
393			       "uninitialized object, something is seriously wrong.\n",
394			       kobject_name(kobj), kobj);
395			dump_stack();
396			return -EINVAL;
397		}
398		va_start(args, fmt);
399		retval = kobject_add_varg(kobj, parent, fmt, args);

-- Package-specific info:
** Version:
Linux version 3.16-0.bpo.3-amd64 (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.16.5-1~bpo70+1 (2014-11-02)

-- debconf information:
  linux-image-3.16-0.bpo.3-amd64/postinst/mips-initrd-3.16-0.bpo.3-amd64:
  linux-image-3.16-0.bpo.3-amd64/postinst/depmod-error-initrd-3.16-0.bpo.3-amd64: false
  linux-image-3.16-0.bpo.3-amd64/prerm/removing-running-kernel-3.16-0.bpo.3-amd64: true
Reply to:
Prev by Date: Bug#772716: linux-image-3.16.0-4-amd64: no support for NVidia GK208 (NV108)
Next by Date: Processed: severity of 772733 is important
Previous by thread: Bug#772724: linux-image-3.16.0-4-686-pae: invalid opcode: 0000 [#1] / freeze: cgroup_rm_cftypes_locked+0x3d/0x50
Next by thread: Processed: severity of 772733 is important
Index(es):
- Date
- Thread