Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100

To: Jonathan Nieder <jrnieder@gmail.com>
Cc: 637085@bugs.debian.org
Subject: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
From: Paul Elliott <paul.elliott@ynic.york.ac.uk>
Date: Mon, 08 Aug 2011 18:10:41 +0100
Message-id: <[🔎] 4E401891.3010002@ynic.york.ac.uk>
Reply-to: Paul Elliott <paul.elliott@ynic.york.ac.uk>, 637085@bugs.debian.org
In-reply-to: <[🔎] 20110808151643.GA20726@elie.gateway.2wire.net>
References: <[🔎] 20110808122424.2597.81303.reportbug@sulcus.ynic.york.ac.uk> <[🔎] 20110808151643.GA20726@elie.gateway.2wire.net>

Hi Jonathan,

On 08/08/11 16:16, Jonathan Nieder wrote:

I assume this is fairly reproducible even after a reboot?  Is the

Correct, we can reproduce the lock ups after a reboot following 5-60minutes of high I/O load (900MB/s plus).

stacktrace from the first sign of trouble in dmesg always the same?

I'm no expert at reading these but I believe it is the same. Here's thetrace after the next reboot/lock up cycle:

[ 3705.959849] kernel BUG at/build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/mm/slub.c:2969!

[ 3706.077621] invalid opcode: 0000 [#1] SMP

[ 3706.113947] last sysfs file:/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-3/target1:0:1/1:0:1:0/block/sdj/stat

[ 3706.235513] CPU 0

[ 3706.251928] Modules linked in: btrfs zlib_deflate crc32c libcrc32cufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfsext4 jbd2 crc16 ext2 dm_round_robin dm_multipath scsi_dh loop sd_modcrc_t10dif snd_pcm joydev snd_timer snd soundcore snd_page_alloc usbhidhid evdev pcspkr hpilo hpwdt psmouse power_meter container processorbutton serio_raw ext3 jbd mbcache dm_mod hpsa cciss uhci_hcd ehci_hcdqla2xxx usbcore scsi_transport_fc nls_base scsi_tgt scsi_mod be2netthermal thermal_sys [last unloaded: scsi_wait_scan][ 3706.781628] Pid: 1845, comm: ext4-dio-unwrit Not tainted2.6.32-5-amd64 #1 ProLiant BL460c G7[ 3706.882853] RIP: 0010:[<ffffffff810e730b>] [<ffffffff810e730b>]kfree+0x55/0xcb

[ 3706.956205] RSP: 0018:ffff8805851c7e00  EFLAGS: 00010246

[ 3707.017700] RAX: 0200000000000000 RBX: ffff88058553eed0 RCX:0000000000000042[ 3707.091197] RDX: ffff88058553eea0 RSI: 0000000000000041 RDI:ffffea001352a590[ 3707.167835] RBP: ffff88058553eea0 R08: ffff880585fdc0d0 R09:0000000000080000[ 3707.245578] R10: 0000000000000014 R11: ffff880584a6b8b8 R12:ffffffffa023ddcf[ 3707.319659] R13: ffff88058553eed8 R14: ffff880584a6b880 R15:ffff880584a6b880[ 3707.393985] FS: 0000000000000000(0000) GS:ffff880015200000(0000)knlGS:0000000000000000

[ 3707.476061] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b

[ 3707.538541] CR2: 00007f4ffff0377c CR3: 000000026295b000 CR4:00000000000006f0[ 3707.627218] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000[ 3707.707945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:0000000000000400[ 3707.788003] Process ext4-dio-unwrit (pid: 1845, threadinfoffff8805851c6000, task ffff880584a6b880)

[ 3707.885120] Stack:

[ 3707.914054] ffff88058553eed0 ffff88058553eea0 ffff8805844b0928ffffffffa023ddcf[ 3707.992872] <0> ffff8805851c7ef8 ffffe8ffffa08680 ffff88058553eed0ffffffff810618e7[ 3708.072050] <0> 000000000000f9e0 ffff880584a6bc38 ffff880584a6b880ffff8805851c7fd8

[ 3708.169803] Call Trace:

[ 3708.211806] [<ffffffffa023ddcf>] ? ext4_end_aio_dio_work+0x4e/0x5a[ext4]

[ 3708.285689]  [<ffffffff810618e7>] ? worker_thread+0x188/0x21d
[ 3708.340716]  [<ffffffffa023dd81>] ? ext4_end_aio_dio_work+0x0/0x5a [ext4]
[ 3708.415673]  [<ffffffff81064f1a>] ? autoremove_wake_function+0x0/0x2e
[ 3708.495456]  [<ffffffff8106175f>] ? worker_thread+0x0/0x21d
[ 3708.554553]  [<ffffffff81064c4d>] ? kthread+0x79/0x81
[ 3708.616011]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 3708.675317]  [<ffffffff81064bd4>] ? kthread+0x0/0x81
[ 3708.730683]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20

[ 3708.784232] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 0000 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c075 04 <0f> 0b eb fe 5b 5d 41 5c e9 98 56 fd ff 48 8b 4c 24 18 4c 8b 4f

[ 3708.990151] RIP  [<ffffffff810e730b>] kfree+0x55/0xcb
[ 3709.047553]  RSP <ffff8805851c7e00>
[ 3709.095349] ---[ end trace fec09b541df2db86 ]---

[ 3709.158246] kernel tried to execute NX-protected page - exploitattempt? (uid: 0)

I now have serial console logging enabled on these servers so I canprovide a fuller copy of the trace if required although I'm guessing theonly useful output is that pasted above.

Did this machine work well with other kernels before (and if so,
which ones)?

The machine is new and so we haven't tried older kernels, we have triedthe current bpo kernel and also experienced lock ups there although wedidn't have remote/serial logging enabled at the time. I can retest andcapture the logs if that would be useful.

If you get a chance to run memtest68+, that would also be useful, of
course.

We have 5 of these blades, all identical. I memtest86+'d them on arrivala couple of weeks ago, everything was clean. I'll retest tonight though,just to be on the safe side. I'll also repeat earlier tests on one ofthe other blades to capture a trace (we've seen lock ups on the otherblades too but again, didn't have remote/serial logging enabled at the time)


Thanks, Paul.

--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York

Reply to:

Follow-Ups:
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Jonathan Nieder <jrnieder@gmail.com>

References:
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Paul Elliott <paul.elliott@ynic.york.ac.uk>
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Jonathan Nieder <jrnieder@gmail.com>

Prev by Date: Bug#632923: [GIT PULL 0/4] perf/core fixes
Next by Date: Bug#614622: linux-image-2.6.37-1-686: atl2 NIC claims NO CARRIER after suspend/resume; rmmod+insmod fixes the problem
Previous by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Next by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Index(es):
- Date
- Thread