On Thu, 2011-03-31 at 12:30 +0200, Giuseppe Lavagetto wrote: > Hi, posting here since I am evidently reproducing this bug. > > Under load (relatively mild anyway) a 24-core X5660, 24GB RAM Dell > Poweredge 710 gets stuck with 100% cpu usage (meaning one core gets > stuck in running kswapd). The peculiarity of the situation is that NO > swap is being allocated, si and so columns in vmstat output show no swap > usage, and swap was correctly mounted. Also, it was not running > completely out of RAM. > > The machine eventually freezed so I was not able to get any information > apart from the kernel stack trace, which I post at the end of the > report. > > This issue seems to be a known bug in the linux kernel, and as far as I > understand a patch is available (and already included in RH kernels): > > http://kerneltrap.org/mailarchive/linux-kernel/2010/10/27/4637977 I really don't think that deals with the same bug you are seeing. > I'll try to reproduce the problem, in the meantime do you think the > solution Mel proposed could be ported back to the stable kernel? Perhaps, if Mel or one of the upstream developers does it. I don't believe anyone in the Debian kernel team is sufficiently familiar with the VMM to backport this significant change. > Kernel stack trace (excerpt) is attached. > > Best, > Giuseppe > application log attachment (kernel.log) > [86613.384580] Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd ixgbe dca lru_cache cn ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu bonding ext3 jbd mbcache loop snd_pcm dcdbas joydev power_meter processor button psmouse snd_timer serio_raw snd soundcore snd_page_alloc evdev pcspkr xfs exportfs sg sr_mod cdrom ata_generic usbhid hid uhci_hcd sd_mod ses crc_t10dif enclosure thermal ehci_hcd ata_piix usbcore libata megaraid_sas nls_base scsi_mod bnx2 thermal_sys [last unloaded: drbd] > [86613.384610] CPU 2: > [86613.384611] Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd ixgbe dca lru_cache cn ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu bonding ext3 jbd mbcache loop snd_pcm dcdbas joydev power_meter processor button psmouse snd_timer serio_raw snd soundcore snd_page_alloc evdev pcspkr xfs exportfs sg sr_mod cdrom ata_generic usbhid hid uhci_hcd sd_mod ses crc_t10dif enclosure thermal ehci_hcd ata_piix usbcore libata megaraid_sas nls_base scsi_mod bnx2 thermal_sys [last unloaded: drbd] > [86613.384635] Pid: 207, comm: kswapd0 Not tainted 2.6.32-5-amd64 #1 PowerEdge R710 > [86613.384636] RIP: 0010:[<ffffffff810b3f19>] [<ffffffff810b3f19>] find_get_pages+0x5f/0xbb > [86613.384645] RSP: 0018:ffff88062c869bc0 EFLAGS: 00000293 > [86613.384646] RAX: ffffffffffffffff RBX: ffff88062c869c50 RCX: 0000000000000000 > [86613.384648] RDX: 0000000000000040 RSI: ffffea0002bc56e0 RDI: ffffea0002bc56d8 > [86613.384649] RBP: ffffffff8101166e R08: ffff88062c869b80 R09: 0000000000000002 > [86613.384651] R10: 0000000000000040 R11: ffff880093d74ad8 R12: 0000000000000005 > [86613.384653] R13: 0000000000000286 R14: ffff88000000b100 R15: ffff88000000c780 > [86613.384655] FS: 0000000000000000(0000) GS:ffff88033ac20000(0000) knlGS:0000000000000000 > [86613.384656] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [86613.384658] CR2: 00007fffd81dd038 CR3: 0000000001001000 CR4: 00000000000006e0 > [86613.384659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [86613.384661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [86613.384663] Call Trace: > [86613.384668] [<ffffffff810bc034>] ? pagevec_lookup+0x17/0x1e > [86613.384671] [<ffffffff810bcdf1>] ? invalidate_mapping_pages+0xb9/0xdb > [86613.384675] [<ffffffff81100573>] ? shrink_icache_memory+0xfc/0x228 > [86613.384678] [<ffffffff810bf3f5>] ? shrink_slab+0xe0/0x153 > [86613.384680] [<ffffffff810bfc98>] ? kswapd+0x4d9/0x686 > [86613.384683] [<ffffffff810bd30f>] ? isolate_pages_global+0x0/0x20f > [86613.384687] [<ffffffff81064e96>] ? autoremove_wake_function+0x0/0x2e > [86613.384691] [<ffffffff8103aa56>] ? __wake_up_common+0x44/0x72 > [86613.384693] [<ffffffff810bf7bf>] ? kswapd+0x0/0x686 > [86613.384695] [<ffffffff81064bc9>] ? kthread+0x79/0x81 > [86613.384700] [<ffffffff81011baa>] ? child_rip+0xa/0x20 > [86613.384702] [<ffffffff81064b50>] ? kthread+0x0/0x81 > [86613.384703] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 Please provide the full log messages for this error. If the messages seem to be produced continually then send all messages produced in 1 second (the numbers on the left are time in seconds). Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse.
Attachment:
signature.asc
Description: This is a digitally signed message part