[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100



Thanks for looking into this Jonathan. We've spent the past week performing extensive tests both in a software sense and hardware sense. Here's the steps we've taken and the results obtained.

1) We've re-run our test script[1] on an ext4 file system provided by local 10k SAS disks. We used the same 2.6.32 kernel as we used previously and did not have any issues. This backs up our theory that the issue lies with our QLogic FC HBAs. (Although we do trigger a hard lockup in cciss with a HP P800, we will report this seperately when the system in question becomes available for further testing.) 2) We checked all the cabling within our FC fabric and found one patch cable showing high numbers of errors, we have replaced this cable. 3) We've upgraded the firmware on all components within our fabric, from the server blades BIOS all the way to the FC switches.
4) Repeated memtest86+ memory tests without issue.
5) Re-run original tests on a stock squeeze install, although the tests ran for longer they still ended in the same way:

Aug 11 15:16:33 occiput kernel: [ 7678.041405] ------------[ cut here ]------------

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.041405] ------------[ cut here ]------------
Aug 11 15:16:33 occiput kernel: [ 7678.041437] kernel BUG at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/kernel/workqueue.c:287! Aug 11 15:16:33 occiput kernel: [ 7678.041496] invalid opcode: 0000 [#1] SMP

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.041496] invalid opcode: 0000 [#1] SMP
Aug 11 15:16:33 occiput kernel: [ 7678.041530] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat

Message from syslogd@occiput at Aug 11 15:16:33 ...
kernel:[ 7678.041530] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat
Aug 11 15:16:33 occiput kernel: [ 7678.041586] CPU 0
Aug 11 15:16:33 occiput kernel: [ 7678.041612] Modules linked in: ext4 jbd2 crc16 dm_round_robin sd_mod crc_t10dif ses enclosure ext2 dm_multipath scsi_dh loop snd_pcm snd_timer snd soundcore snd_page_alloc hpwdt hpilo joydev pcspkr evdev psmouse container serio_raw power_meter button processor ext3 jbd mbcache usbhid hid dm_mod uhci_hcd hpsa qla2xxx thermal scsi_transport_fc cciss scsi_tgt ehci_hcd thermal_sys usbcore nls_base scsi_mod be2net [last unloaded: scsi_wait_scan] Aug 11 15:16:33 occiput kernel: [ 7678.041950] Pid: 2752, comm: ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7 Aug 11 15:16:33 occiput kernel: [ 7678.042000] RIP: 0010:[<ffffffff810618d6>] [<ffffffff810618d6>] worker_thread+0x177/0x21d Aug 11 15:16:33 occiput kernel: [ 7678.042059] RSP: 0018:ffff8803ea8fbe40 EFLAGS: 00010282 Aug 11 15:16:33 occiput kernel: [ 7678.042088] RAX: 0000000000000000 RBX: ffff8803ea8fbef8 RCX: ffff880585687c38 Aug 11 15:16:33 occiput kernel: [ 7678.042121] RDX: ffff880585687c38 RSI: ffff8803ea8fbe80 RDI: ffffe8ffffa08680 Aug 11 15:16:33 occiput kernel: [ 7678.042154] RBP: ffffe8ffffa08680 R08: ffff8803ea8fa000 R09: ffff880015215780 Aug 11 15:16:33 occiput kernel: [ 7678.042187] R10: 00000001001c3081 R11: 0000000000000282 R12: ffff880585687c30 Aug 11 15:16:33 occiput kernel: [ 7678.042219] R13: ffff880585687c38 R14: ffff880585549530 R15: ffff880585549530 Aug 11 15:16:33 occiput kernel: [ 7678.042253] FS: 0000000000000000(0000) GS:ffff880015200000(0000) knlGS:0000000000000000 Aug 11 15:16:33 occiput kernel: [ 7678.042302] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 11 15:16:33 occiput kernel: [ 7678.042332] CR2: 00007f9021e1f878 CR3: 0000000001001000 CR4: 00000000000006f0 Aug 11 15:16:33 occiput kernel: [ 7678.042365] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 11 15:16:33 occiput kernel: [ 7678.042398] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 11 15:16:33 occiput kernel: [ 7678.042431] Process ext4-dio-unwrit (pid: 2752, threadinfo ffff8803ea8fa000, task ffff880585549530)
Aug 11 15:16:33 occiput kernel: [ 7678.042481] Stack:

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.042481] Stack:
Aug 11 15:16:33 occiput kernel: [ 7678.042504] 000000000000f9e0 ffff8805855498e8 ffff880585549530 ffff8803ea8fbfd8 Aug 11 15:16:33 occiput kernel: [ 7678.042547] <0> ffff880585549530 ffffe8ffffa08698 ffffe8ffffa08688 ffffffffa0234d81 Aug 11 15:16:33 occiput kernel: [ 7678.042611] <0> 0000000000000000 ffff880585549530 ffffffff81064f1a ffff8803ea8fbe98
Aug 11 15:16:33 occiput kernel: [ 7678.042694] Call Trace:

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.042694] Call Trace:
Aug 11 15:16:33 occiput kernel: [ 7678.042727] [<ffffffffa0234d81>] ? ext4_end_aio_dio_work+0x0/0x5a [ext4] Aug 11 15:16:33 occiput kernel: [ 7678.042761] [<ffffffff81064f1a>] ? autoremove_wake_function+0x0/0x2e Aug 11 15:16:33 occiput kernel: [ 7678.042794] [<ffffffff8106175f>] ? worker_thread+0x0/0x21d Aug 11 15:16:33 occiput kernel: [ 7678.042825] [<ffffffff81064c4d>] ? kthread+0x79/0x81 Aug 11 15:16:33 occiput kernel: [ 7678.042856] [<ffffffff81011baa>] ? child_rip+0xa/0x20 Aug 11 15:16:33 occiput kernel: [ 7678.042886] [<ffffffff81064bd4>] ? kthread+0x0/0x81 Aug 11 15:16:33 occiput kernel: [ 7678.042915] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 Aug 11 15:16:33 occiput kernel: [ 7678.042943] Code: 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 00 48 89 40 08 66 ff 45 00 fb 66 0f 1f 44 00 00 49 8b 45 f8 48 83 e0 fc 48 39 c5 74 04 <0f> 0b eb fe f0 41 80 65 f8 fe 4c 89 e7 ff 54 24 38 48 8b 44 24 Aug 11 15:16:33 occiput kernel: [ 7678.043239] RIP [<ffffffff810618d6>] worker_thread+0x177/0x21d
Aug 11 15:16:33 occiput kernel: [ 7678.043274]  RSP <ffff8803ea8fbe40>
Aug 11 15:16:33 occiput kernel: [ 7678.043706] ---[ end trace e0f3d4c037247dda ]---

6) Ran the test on 2.6.32-71.el6.x86_64 from CentOS 6. This kernel runs fine. Does not emit the underrun errors. For info, centos are using firmware 5.03.02 and driver version 8.03.01.05.06.0-k8. Squeeze is using firmware 5.03.02 and driver version 8.03.01-k6. 7) Ran the test on linux-image-3.0.0-1-amd64 (sid), 2.6.39-bpo.2-amd64 and linux-image-2.6.38-bpo.2-amd64. All of these run fine. Does not emit the underrun errors as reported in the original bug report, could be related.

Where should we go from here?

[1] Our test script mounts 3 x 1TB ext4 volumes and then continuously loops through a bonnie++ test and four fio runs performing sequential reads, random read/write, sequential write and random write tests.

--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York




Reply to: