Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Thanks for looking into this Jonathan. We've spent the past week
performing extensive tests both in a software sense and hardware sense.
Here's the steps we've taken and the results obtained.
1) We've re-run our test script[1] on an ext4 file system provided by
local 10k SAS disks. We used the same 2.6.32 kernel as we used
previously and did not have any issues. This backs up our theory that
the issue lies with our QLogic FC HBAs. (Although we do trigger a hard
lockup in cciss with a HP P800, we will report this seperately when the
system in question becomes available for further testing.)
2) We checked all the cabling within our FC fabric and found one patch
cable showing high numbers of errors, we have replaced this cable.
3) We've upgraded the firmware on all components within our fabric, from
the server blades BIOS all the way to the FC switches.
4) Repeated memtest86+ memory tests without issue.
5) Re-run original tests on a stock squeeze install, although the tests
ran for longer they still ended in the same way:
Aug 11 15:16:33 occiput kernel: [ 7678.041405] ------------[ cut here
]------------
Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.041405] ------------[ cut here ]------------
Aug 11 15:16:33 occiput kernel: [ 7678.041437] kernel BUG at
/build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/kernel/workqueue.c:287!
Aug 11 15:16:33 occiput kernel: [ 7678.041496] invalid opcode: 0000 [#1]
SMP
Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.041496] invalid opcode: 0000 [#1] SMP
Aug 11 15:16:33 occiput kernel: [ 7678.041530] last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat
Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.041530] last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat
Aug 11 15:16:33 occiput kernel: [ 7678.041586] CPU 0
Aug 11 15:16:33 occiput kernel: [ 7678.041612] Modules linked in: ext4
jbd2 crc16 dm_round_robin sd_mod crc_t10dif ses enclosure ext2
dm_multipath scsi_dh loop snd_pcm snd_timer snd soundcore snd_page_alloc
hpwdt hpilo joydev pcspkr evdev psmouse container serio_raw power_meter
button processor ext3 jbd mbcache usbhid hid dm_mod uhci_hcd hpsa
qla2xxx thermal scsi_transport_fc cciss scsi_tgt ehci_hcd thermal_sys
usbcore nls_base scsi_mod be2net [last unloaded: scsi_wait_scan]
Aug 11 15:16:33 occiput kernel: [ 7678.041950] Pid: 2752, comm:
ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7
Aug 11 15:16:33 occiput kernel: [ 7678.042000] RIP:
0010:[<ffffffff810618d6>] [<ffffffff810618d6>] worker_thread+0x177/0x21d
Aug 11 15:16:33 occiput kernel: [ 7678.042059] RSP:
0018:ffff8803ea8fbe40 EFLAGS: 00010282
Aug 11 15:16:33 occiput kernel: [ 7678.042088] RAX: 0000000000000000
RBX: ffff8803ea8fbef8 RCX: ffff880585687c38
Aug 11 15:16:33 occiput kernel: [ 7678.042121] RDX: ffff880585687c38
RSI: ffff8803ea8fbe80 RDI: ffffe8ffffa08680
Aug 11 15:16:33 occiput kernel: [ 7678.042154] RBP: ffffe8ffffa08680
R08: ffff8803ea8fa000 R09: ffff880015215780
Aug 11 15:16:33 occiput kernel: [ 7678.042187] R10: 00000001001c3081
R11: 0000000000000282 R12: ffff880585687c30
Aug 11 15:16:33 occiput kernel: [ 7678.042219] R13: ffff880585687c38
R14: ffff880585549530 R15: ffff880585549530
Aug 11 15:16:33 occiput kernel: [ 7678.042253] FS:
0000000000000000(0000) GS:ffff880015200000(0000) knlGS:0000000000000000
Aug 11 15:16:33 occiput kernel: [ 7678.042302] CS: 0010 DS: 0018 ES:
0018 CR0: 000000008005003b
Aug 11 15:16:33 occiput kernel: [ 7678.042332] CR2: 00007f9021e1f878
CR3: 0000000001001000 CR4: 00000000000006f0
Aug 11 15:16:33 occiput kernel: [ 7678.042365] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 15:16:33 occiput kernel: [ 7678.042398] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 11 15:16:33 occiput kernel: [ 7678.042431] Process ext4-dio-unwrit
(pid: 2752, threadinfo ffff8803ea8fa000, task ffff880585549530)
Aug 11 15:16:33 occiput kernel: [ 7678.042481] Stack:
Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.042481] Stack:
Aug 11 15:16:33 occiput kernel: [ 7678.042504] 000000000000f9e0
ffff8805855498e8 ffff880585549530 ffff8803ea8fbfd8
Aug 11 15:16:33 occiput kernel: [ 7678.042547] <0> ffff880585549530
ffffe8ffffa08698 ffffe8ffffa08688 ffffffffa0234d81
Aug 11 15:16:33 occiput kernel: [ 7678.042611] <0> 0000000000000000
ffff880585549530 ffffffff81064f1a ffff8803ea8fbe98
Aug 11 15:16:33 occiput kernel: [ 7678.042694] Call Trace:
Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.042694] Call Trace:
Aug 11 15:16:33 occiput kernel: [ 7678.042727] [<ffffffffa0234d81>] ?
ext4_end_aio_dio_work+0x0/0x5a [ext4]
Aug 11 15:16:33 occiput kernel: [ 7678.042761] [<ffffffff81064f1a>] ?
autoremove_wake_function+0x0/0x2e
Aug 11 15:16:33 occiput kernel: [ 7678.042794] [<ffffffff8106175f>] ?
worker_thread+0x0/0x21d
Aug 11 15:16:33 occiput kernel: [ 7678.042825] [<ffffffff81064c4d>] ?
kthread+0x79/0x81
Aug 11 15:16:33 occiput kernel: [ 7678.042856] [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Aug 11 15:16:33 occiput kernel: [ 7678.042886] [<ffffffff81064bd4>] ?
kthread+0x0/0x81
Aug 11 15:16:33 occiput kernel: [ 7678.042915] [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Aug 11 15:16:33 occiput kernel: [ 7678.042943] Code: 08 48 8b 50 08 48
89 51 08 48 89 0a 48 89 00 48 89 40 08 66 ff 45 00 fb 66 0f 1f 44 00 00
49 8b 45 f8 48 83 e0 fc 48 39 c5 74 04 <0f> 0b eb fe f0 41 80 65 f8 fe
4c 89 e7 ff 54 24 38 48 8b 44 24
Aug 11 15:16:33 occiput kernel: [ 7678.043239] RIP [<ffffffff810618d6>]
worker_thread+0x177/0x21d
Aug 11 15:16:33 occiput kernel: [ 7678.043274] RSP <ffff8803ea8fbe40>
Aug 11 15:16:33 occiput kernel: [ 7678.043706] ---[ end trace
e0f3d4c037247dda ]---
6) Ran the test on 2.6.32-71.el6.x86_64 from CentOS 6. This kernel runs
fine. Does not emit the underrun errors. For info, centos are using
firmware 5.03.02 and driver version 8.03.01.05.06.0-k8. Squeeze is using
firmware 5.03.02 and driver version 8.03.01-k6.
7) Ran the test on linux-image-3.0.0-1-amd64 (sid), 2.6.39-bpo.2-amd64
and linux-image-2.6.38-bpo.2-amd64. All of these run fine. Does not emit
the underrun errors as reported in the original bug report, could be
related.
Where should we go from here?
[1] Our test script mounts 3 x 1TB ext4 volumes and then continuously
loops through a bonnie++ test and four fio runs performing sequential
reads, random read/write, sequential write and random write tests.
--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York
Reply to: