Bug#626593: linux-image-2.6.32-5-amd64: BUG during disk hot-plugging when setting the elevator via udev

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#626593: linux-image-2.6.32-5-amd64: BUG during disk hot-plugging when setting the elevator via udev
From: Apollon Oikonomopoulos <apoikos@gmail.com>
Date: Fri, 13 May 2011 16:34:48 +0300
Message-id: <[🔎] 20110513133448.GA20779@noc.grnet.gr>
Reply-to: Apollon Oikonomopoulos <apoikos@gmail.com>, 626593@bugs.debian.org

Package: linux-2.6
Version: 2.6.32-31
Severity: normal

Hi,

We are experiencing the following problem on a number of machines using
2.6.32-5-amd64_2.6.32-31. The machines are used for virtual machine hosting and
have a number of LUNs exported from an FC-connected SAN connected to them via a
multipath topology. Our regular workflow involves hot-removing and hot-adding
disks according to the VMs hosted.

For the LUNs exported by the SAN storage, we have the following udev rule in place:

-----8<-----
# Set all netapp LUN schedulers to noop

# Skip partitions
KERNEL=="*[0-9]", GOTO="lunsched_end"

# Set scheduler
ACTION=="add", SUBSYSTEM=="block", ATTRS{vendor}=="NETAPP", ATTRS{model}=="LUN",	ATTR{queue/scheduler}="noop"

LABEL="lunsched_end"
-----8<-----

Having upgraded from lenny to squeeze last week, we encountered the following
crash during a SCSI bus rescan that added new disks to a system:

[ 1258.343275] ------------[ cut here ]------------
[ 1258.343280] sd 0:0:0:226: [sdgv] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 1258.343287] kernel BUG at /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/sysfs/file.c:539!
[ 1258.343289] invalid opcode: 0000 [#2] SMP
[ 1258.343292] last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:10:00.0/host0/rport-0:0-0/target0:0:0/0:0:0:216/block/sdgn/removable
[ 1258.343295] CPU 4
[ 1258.343296] Modules linked in: kvm_intel kvm nf_conntrack_ipv6 ip6table_filter ip6_tables xt_tcpudp xt_pkttype nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff ipmi_devintf radeon ttm drm_kms_helper snd_pcm ipmi_si drm ipmi_msghandler i2c_algo_bit i5k_amb i2c_core snd_timer psmouse i5000_edac snd soundcore snd_page_alloc hpwdt hpilo serio_raw edac_core pcspkr rng_core evdev shpchp container pci_hotplug button processor ext3 jbd mbcache dm_mod sd_mod crc_t10dif usbhid hid uhci_hcd qla2xxx scsi_transport_fc tg3 ehci_hcd bnx2 scsi_tgt usbcore nls_base cciss libphy scsi_mod thermal thermal_sys [last unloaded: scsi_wait_scan]
[ 1258.343332] Pid: 12287, comm: async/20 Tainted: G      D W  2.6.32-5-amd64 #1 ProLiant BL460c G1
[ 1258.343335] RIP: 0010:[<ffffffff8113ecff>]  [<ffffffff8113ecff>] sysfs_create_file+0x13/0x21
[ 1258.343340] RSP: 0018:ffff8803ffd3fdd8  EFLAGS: 00010246
[ 1258.343342] RAX: 0000000000000000 RBX: ffffffff81485598 RCX: 000000000000589a
[ 1258.343344] RDX: ffffffff81476c38 RSI: ffffffff81485598 RDI: 0000000000000000
[ 1258.343347] RBP: ffff88041afaba90 R08: 0000000000000000 R09: ffffffff813ad975
[ 1258.343349] R10: 00000000fffffff4 R11: 00000000000186a0 R12: 0000000000000000
[ 1258.343351] R13: 0000000000000000 R14: ffff8804264d4458 R15: 0000000000000000
[ 1258.343354] FS:  0000000000000000(0000) GS:ffff88000fd00000(0000) knlGS:0000000000000000
[ 1258.343356] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1258.343359] CR2: 00007f541f420000 CR3: 00000003fddee000 CR4: 00000000000426e0
[ 1258.343361] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1258.343363] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1258.343366] Process async/20 (pid: 12287, threadinfo ffff8803ffd3e000, task ffff8804273f1530)
[ 1258.343368] Stack:
[ 1258.343369]  ffffffff811761aa 0000000000000000 ffff8804078c6418 ffff8804078c60f0
[ 1258.343372] <0> ffffffff8118024b ffff8804264d4400 ffff8804264d4400 ffff8804083ba800
[ 1258.343375] <0> ffff8804083ba928 0000000000000001 ffffffff81184021 ffff8804264d4400
[ 1258.343378] Call Trace:
[ 1258.343381]  [<ffffffff811761aa>] ? elv_register_queue+0x4f/0x6f
[ 1258.343385]  [<ffffffff8118024b>] ? blk_register_queue+0x7f/0xcc
[ 1258.343388]  [<ffffffff81184021>] ? add_disk+0xb8/0x108        
[ 1258.343393]  [<ffffffffa01506be>] ? sd_probe_async+0x119/0x1d8 [sd_mod]
[ 1258.343396]  [<ffffffff810698a7>] ? async_thread+0x0/0x20d
[ 1258.343399]  [<ffffffff810699a6>] ? async_thread+0xff/0x20d
[ 1258.343403]  [<ffffffff81049fee>] ? default_wake_function+0x0/0x9
[ 1258.343406]  [<ffffffff810698a7>] ? async_thread+0x0/0x20d
[ 1258.343408]  [<ffffffff81064721>] ? kthread+0x79/0x81
[ 1258.343411]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 1258.343414]  [<ffffffff810646a8>] ? kthread+0x0/0x81
[ 1258.343416]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20                                                             
[ 1258.343418] Code: 74 0f 48 89 ef e8 24 07 00 00 eb 05 bb fe ff ff ff 89 d8 5b 5d 41 5c c3 48 85 ff 74 0e 48 8b 7f 30 48 85 ff 74 05 48 85 f6 75 04 <0f> 0b eb fe ba 02 00 00 00 e9 5d ff ff ff 55 53 48 89 fb 48 c7                                      
[ 1258.343437] RIP  [<ffffffff8113ecff>] sysfs_create_file+0x13/0x21
[ 1258.343440]  RSP <ffff8803ffd3fdd8>
[ 1258.343443] ---[ end trace eeb541477f3e4233 ]---

Apart from this BUG, there are also sporadic warnings, like:
[  324.454044] kobject_add_internal failed for hÝ&$ÿÿÿÿiosched with -EEXIST, don't try to register things with the same name in the same directory.
(notice the garbage in the output)

The above is reproducible within a (small) number of SCSI bus rescans on our
systems. Disabling the relevant udev rule causes the issue to disappear.

I should note that we were using 2.6.32-bpo.5_2.6.32-23 with Lenny without any
issues, which led us to suspect udev for exposing this bug. Indeed, when
running with udev_0.125-7+lenny3 on squeeze with squeeze's kernel, the problem
is never reproducible.

Speculation
-----------
>From what I can gather from the backtrace, what _probably_ happens 
during disk
addition is the following:

1. add_disk() calls blk_register_queue()
2. blk_register_queue()(block/blk-sysfs.c:428) adds a kobject for the "queue"
   attribute and emits a uevent.
3. blk_register_queue() calls elv_register_queue()
4. elv_register_queue()(block/elevator.c:902) allocates a kobject for the
   queue's "iosched" attribute, using a reference of the queue's elevator.
5. elv_register_queue() populates the sysfs subtree of "iosched"

Now, the BUG and the WARNING are presumably triggered in the following way:

a) udev receives the uevent from step 2 and uses it to set the queue/scheduler
   attribute via sysfs. This in turn results in a call to
   elevator_switch()(block/elevator.c:983), which removes the old elevator from
   the queue and attaches the new elevator, in turn calling elv_register_queue().
b) if at this point steps 4-5 above have not completed, one of the following
   _may_ happen:
   - Both elv_register_queue()s (the one called by blk_register_queue() and the
     one called by elevator_switch()) run simultaneously with the same elevator
     reference (that of the "new" elevator set via sysfs). In this case the
     WARNING above about the duplicate iosched kobject appears and all is fine.
					OR
   - elv_register_queue() called by elevator_switch() replaces and frees the old
     elevator, which has a reference still in use by
     elv_register_queue()-called-by-blk_register_queue(). If the latter is still
     using the reference to populate the iosched subtree (see
     block/elevator.c:902+), then the BUG appears. 

In other words, we seem to be changing the elevator before the original one has
been completely set-up in sysfs, which is a race condition that can be
triggered in a busy SMP system with lots of disks added at the same time.

Upstream commit 430c62fb2948d964cf8dc7f3e2f69623c04ef62f seems to provide a
resolution to a similar-sounding issue, but 2.6.38 (which contains this commit)
also crashes in exactly the same way.

Thanks,
Apollon

-- Package-specific info:
** Version:
Linux version 2.6.32-5-amd64 (Debian 2.6.32-31) (ben@decadent.org.uk) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Mar 7 21:35:22 UTC 2011

Reply to:

Follow-Ups:
- Bug#626593: linux-image-2.6.32-5-amd64: BUG during disk hot-plugging when setting the elevator via udev
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Processed: tagging 624702
Next by Date: Bug#625914: [Bridge] Bug#625914: linux-image-2.6.38-2-amd64: bridging is not interacting well with multicast in 2.6.38-4
Previous by thread: Bug#481104: marked as done (/usr/sbin/update-initramfs: update-initramfs edits /etc/initramfs-tools/conf.d/cryptroot)
Next by thread: Bug#626593: linux-image-2.6.32-5-amd64: BUG during disk hot-plugging when setting the elevator via udev
Index(es):
- Date
- Thread