[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100



Hi Jonathan,

On 08/08/11 16:16, Jonathan Nieder wrote:
I assume this is fairly reproducible even after a reboot?  Is the

Correct, we can reproduce the lock ups after a reboot following 5-60 minutes of high I/O load (900MB/s plus).

stacktrace from the first sign of trouble in dmesg always the same?

I'm no expert at reading these but I believe it is the same. Here's the trace after the next reboot/lock up cycle:

[ 3705.959849] kernel BUG at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/mm/slub.c:2969!
[ 3706.077621] invalid opcode: 0000 [#1] SMP
[ 3706.113947] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-3/target1:0:1/1:0:1:0/block/sdj/stat
[ 3706.235513] CPU 0
[ 3706.251928] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 dm_round_robin dm_multipath scsi_dh loop sd_mod crc_t10dif snd_pcm joydev snd_timer snd soundcore snd_page_alloc usbhid hid evdev pcspkr hpilo hpwdt psmouse power_meter container processor button serio_raw ext3 jbd mbcache dm_mod hpsa cciss uhci_hcd ehci_hcd qla2xxx usbcore scsi_transport_fc nls_base scsi_tgt scsi_mod be2net thermal thermal_sys [last unloaded: scsi_wait_scan] [ 3706.781628] Pid: 1845, comm: ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7 [ 3706.882853] RIP: 0010:[<ffffffff810e730b>] [<ffffffff810e730b>] kfree+0x55/0xcb
[ 3706.956205] RSP: 0018:ffff8805851c7e00  EFLAGS: 00010246
[ 3707.017700] RAX: 0200000000000000 RBX: ffff88058553eed0 RCX: 0000000000000042 [ 3707.091197] RDX: ffff88058553eea0 RSI: 0000000000000041 RDI: ffffea001352a590 [ 3707.167835] RBP: ffff88058553eea0 R08: ffff880585fdc0d0 R09: 0000000000080000 [ 3707.245578] R10: 0000000000000014 R11: ffff880584a6b8b8 R12: ffffffffa023ddcf [ 3707.319659] R13: ffff88058553eed8 R14: ffff880584a6b880 R15: ffff880584a6b880 [ 3707.393985] FS: 0000000000000000(0000) GS:ffff880015200000(0000) knlGS:0000000000000000
[ 3707.476061] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 3707.538541] CR2: 00007f4ffff0377c CR3: 000000026295b000 CR4: 00000000000006f0 [ 3707.627218] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3707.707945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3707.788003] Process ext4-dio-unwrit (pid: 1845, threadinfo ffff8805851c6000, task ffff880584a6b880)
[ 3707.885120] Stack:
[ 3707.914054] ffff88058553eed0 ffff88058553eea0 ffff8805844b0928 ffffffffa023ddcf [ 3707.992872] <0> ffff8805851c7ef8 ffffe8ffffa08680 ffff88058553eed0 ffffffff810618e7 [ 3708.072050] <0> 000000000000f9e0 ffff880584a6bc38 ffff880584a6b880 ffff8805851c7fd8
[ 3708.169803] Call Trace:
[ 3708.211806] [<ffffffffa023ddcf>] ? ext4_end_aio_dio_work+0x4e/0x5a [ext4]
[ 3708.285689]  [<ffffffff810618e7>] ? worker_thread+0x188/0x21d
[ 3708.340716]  [<ffffffffa023dd81>] ? ext4_end_aio_dio_work+0x0/0x5a [ext4]
[ 3708.415673]  [<ffffffff81064f1a>] ? autoremove_wake_function+0x0/0x2e
[ 3708.495456]  [<ffffffff8106175f>] ? worker_thread+0x0/0x21d
[ 3708.554553]  [<ffffffff81064c4d>] ? kthread+0x79/0x81
[ 3708.616011]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 3708.675317]  [<ffffffff81064bd4>] ? kthread+0x0/0x81
[ 3708.730683]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[ 3708.784232] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 98 56 fd ff 48 8b 4c 24 18 4c 8b 4f
[ 3708.990151] RIP  [<ffffffff810e730b>] kfree+0x55/0xcb
[ 3709.047553]  RSP <ffff8805851c7e00>
[ 3709.095349] ---[ end trace fec09b541df2db86 ]---
[ 3709.158246] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

I now have serial console logging enabled on these servers so I can provide a fuller copy of the trace if required although I'm guessing the only useful output is that pasted above.

Did this machine work well with other kernels before (and if so,
which ones)?

The machine is new and so we haven't tried older kernels, we have tried the current bpo kernel and also experienced lock ups there although we didn't have remote/serial logging enabled at the time. I can retest and capture the logs if that would be useful.

If you get a chance to run memtest68+, that would also be useful, of
course.

We have 5 of these blades, all identical. I memtest86+'d them on arrival a couple of weeks ago, everything was clean. I'll retest tonight though, just to be on the safe side. I'll also repeat earlier tests on one of the other blades to capture a trace (we've seen lock ups on the other blades too but again, didn't have remote/serial logging enabled at the time)

Thanks, Paul.

--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York



Reply to: