Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100

To: 637085@bugs.debian.org
Cc: Jonathan Nieder <jrnieder@gmail.com>
Subject: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
From: Paul Elliott <paul.elliott@ynic.york.ac.uk>
Date: Mon, 15 Aug 2011 11:29:18 +0100
Message-id: <[🔎] 4E48F4FE.4000407@ynic.york.ac.uk>
Reply-to: Paul Elliott <paul.elliott@ynic.york.ac.uk>, 637085@bugs.debian.org
In-reply-to: <[🔎] 20110808151643.GA20726@elie.gateway.2wire.net>
References: <[🔎] 20110808122424.2597.81303.reportbug@sulcus.ynic.york.ac.uk> <[🔎] 20110808151643.GA20726@elie.gateway.2wire.net>

Thanks for looking into this Jonathan. We've spent the past weekperforming extensive tests both in a software sense and hardware sense.Here's the steps we've taken and the results obtained.

1) We've re-run our test script[1] on an ext4 file system provided bylocal 10k SAS disks. We used the same 2.6.32 kernel as we usedpreviously and did not have any issues. This backs up our theory thatthe issue lies with our QLogic FC HBAs. (Although we do trigger a hardlockup in cciss with a HP P800, we will report this seperately when thesystem in question becomes available for further testing.)2) We checked all the cabling within our FC fabric and found one patchcable showing high numbers of errors, we have replaced this cable.3) We've upgraded the firmware on all components within our fabric, fromthe server blades BIOS all the way to the FC switches.

4) Repeated memtest86+ memory tests without issue.

5) Re-run original tests on a stock squeeze install, although the testsran for longer they still ended in the same way:

Aug 11 15:16:33 occiput kernel: [ 7678.041405] ------------[ cut here]------------


Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.041405] ------------[ cut here ]------------

Aug 11 15:16:33 occiput kernel: [ 7678.041437] kernel BUG at/build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/kernel/workqueue.c:287!Aug 11 15:16:33 occiput kernel: [ 7678.041496] invalid opcode: 0000 [#1]SMP


Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.041496] invalid opcode: 0000 [#1] SMP

Aug 11 15:16:33 occiput kernel: [ 7678.041530] last sysfs file:/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat


Message from syslogd@occiput at Aug 11 15:16:33 ...

kernel:[ 7678.041530] last sysfs file:/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-4/target1:0:1/1:0:1:8/block/sdh/stat

Aug 11 15:16:33 occiput kernel: [ 7678.041586] CPU 0

Aug 11 15:16:33 occiput kernel: [ 7678.041612] Modules linked in: ext4jbd2 crc16 dm_round_robin sd_mod crc_t10dif ses enclosure ext2dm_multipath scsi_dh loop snd_pcm snd_timer snd soundcore snd_page_allochpwdt hpilo joydev pcspkr evdev psmouse container serio_raw power_meterbutton processor ext3 jbd mbcache usbhid hid dm_mod uhci_hcd hpsaqla2xxx thermal scsi_transport_fc cciss scsi_tgt ehci_hcd thermal_sysusbcore nls_base scsi_mod be2net [last unloaded: scsi_wait_scan]Aug 11 15:16:33 occiput kernel: [ 7678.041950] Pid: 2752, comm:ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7Aug 11 15:16:33 occiput kernel: [ 7678.042000] RIP:0010:[<ffffffff810618d6>] [<ffffffff810618d6>] worker_thread+0x177/0x21dAug 11 15:16:33 occiput kernel: [ 7678.042059] RSP:0018:ffff8803ea8fbe40 EFLAGS: 00010282Aug 11 15:16:33 occiput kernel: [ 7678.042088] RAX: 0000000000000000RBX: ffff8803ea8fbef8 RCX: ffff880585687c38Aug 11 15:16:33 occiput kernel: [ 7678.042121] RDX: ffff880585687c38RSI: ffff8803ea8fbe80 RDI: ffffe8ffffa08680Aug 11 15:16:33 occiput kernel: [ 7678.042154] RBP: ffffe8ffffa08680R08: ffff8803ea8fa000 R09: ffff880015215780Aug 11 15:16:33 occiput kernel: [ 7678.042187] R10: 00000001001c3081R11: 0000000000000282 R12: ffff880585687c30Aug 11 15:16:33 occiput kernel: [ 7678.042219] R13: ffff880585687c38R14: ffff880585549530 R15: ffff880585549530Aug 11 15:16:33 occiput kernel: [ 7678.042253] FS:0000000000000000(0000) GS:ffff880015200000(0000) knlGS:0000000000000000Aug 11 15:16:33 occiput kernel: [ 7678.042302] CS: 0010 DS: 0018 ES:0018 CR0: 000000008005003bAug 11 15:16:33 occiput kernel: [ 7678.042332] CR2: 00007f9021e1f878CR3: 0000000001001000 CR4: 00000000000006f0Aug 11 15:16:33 occiput kernel: [ 7678.042365] DR0: 0000000000000000DR1: 0000000000000000 DR2: 0000000000000000Aug 11 15:16:33 occiput kernel: [ 7678.042398] DR3: 0000000000000000DR6: 00000000ffff0ff0 DR7: 0000000000000400Aug 11 15:16:33 occiput kernel: [ 7678.042431] Process ext4-dio-unwrit(pid: 2752, threadinfo ffff8803ea8fa000, task ffff880585549530)

Aug 11 15:16:33 occiput kernel: [ 7678.042481] Stack:

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.042481] Stack:

Aug 11 15:16:33 occiput kernel: [ 7678.042504] 000000000000f9e0ffff8805855498e8 ffff880585549530 ffff8803ea8fbfd8Aug 11 15:16:33 occiput kernel: [ 7678.042547] <0> ffff880585549530ffffe8ffffa08698 ffffe8ffffa08688 ffffffffa0234d81Aug 11 15:16:33 occiput kernel: [ 7678.042611] <0> 0000000000000000ffff880585549530 ffffffff81064f1a ffff8803ea8fbe98

Aug 11 15:16:33 occiput kernel: [ 7678.042694] Call Trace:

Message from syslogd@occiput at Aug 11 15:16:33 ...
 kernel:[ 7678.042694] Call Trace:

Aug 11 15:16:33 occiput kernel: [ 7678.042727] [<ffffffffa0234d81>] ?ext4_end_aio_dio_work+0x0/0x5a [ext4]Aug 11 15:16:33 occiput kernel: [ 7678.042761] [<ffffffff81064f1a>] ?autoremove_wake_function+0x0/0x2eAug 11 15:16:33 occiput kernel: [ 7678.042794] [<ffffffff8106175f>] ?worker_thread+0x0/0x21dAug 11 15:16:33 occiput kernel: [ 7678.042825] [<ffffffff81064c4d>] ?kthread+0x79/0x81Aug 11 15:16:33 occiput kernel: [ 7678.042856] [<ffffffff81011baa>] ?child_rip+0xa/0x20Aug 11 15:16:33 occiput kernel: [ 7678.042886] [<ffffffff81064bd4>] ?kthread+0x0/0x81Aug 11 15:16:33 occiput kernel: [ 7678.042915] [<ffffffff81011ba0>] ?child_rip+0x0/0x20Aug 11 15:16:33 occiput kernel: [ 7678.042943] Code: 08 48 8b 50 08 4889 51 08 48 89 0a 48 89 00 48 89 40 08 66 ff 45 00 fb 66 0f 1f 44 00 0049 8b 45 f8 48 83 e0 fc 48 39 c5 74 04 <0f> 0b eb fe f0 41 80 65 f8 fe4c 89 e7 ff 54 24 38 48 8b 44 24Aug 11 15:16:33 occiput kernel: [ 7678.043239] RIP [<ffffffff810618d6>]worker_thread+0x177/0x21d

Aug 11 15:16:33 occiput kernel: [ 7678.043274]  RSP <ffff8803ea8fbe40>

Aug 11 15:16:33 occiput kernel: [ 7678.043706] ---[ end tracee0f3d4c037247dda ]---

6) Ran the test on 2.6.32-71.el6.x86_64 from CentOS 6. This kernel runsfine. Does not emit the underrun errors. For info, centos are usingfirmware 5.03.02 and driver version 8.03.01.05.06.0-k8. Squeeze is usingfirmware 5.03.02 and driver version 8.03.01-k6.7) Ran the test on linux-image-3.0.0-1-amd64 (sid), 2.6.39-bpo.2-amd64and linux-image-2.6.38-bpo.2-amd64. All of these run fine. Does not emitthe underrun errors as reported in the original bug report, could berelated.


Where should we go from here?

[1] Our test script mounts 3 x 1TB ext4 volumes and then continuouslyloops through a bonnie++ test and four fio runs performing sequentialreads, random read/write, sequential write and random write tests.


--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York

Reply to:

References:
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Paul Elliott <paul.elliott@ynic.york.ac.uk>
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Jonathan Nieder <jrnieder@gmail.com>

Prev by Date: Bug#635826: hda-intel (CX20549 (Venice)): external speaker output and mic inputs mute
Next by Date: Bug#637874: nfs-kernel-server: need to restart daily the server to let users mount the exports
Previous by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Next by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Index(es):
- Date
- Thread