[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#597489: Kswapd hanging: patch available from lkml



On Thu, 2011-03-31 at 12:30 +0200, Giuseppe Lavagetto wrote:
> Hi, posting here since I am evidently reproducing this bug.
> 
> Under load (relatively mild anyway) a 24-core X5660, 24GB RAM Dell
> Poweredge 710 gets stuck with 100% cpu usage (meaning one core gets
> stuck in running kswapd). The peculiarity of the situation is that NO
> swap is being allocated, si and so columns in vmstat output show no swap
> usage, and swap was correctly mounted. Also, it was not running
> completely out of RAM.
> 
> The machine eventually freezed so I was not able to get any information
> apart from the kernel stack trace, which I post at the end of the
> report. 
> 
> This issue seems to be a known bug in the linux kernel, and as far as I
> understand a patch is available (and already included in RH kernels):
> 
>  http://kerneltrap.org/mailarchive/linux-kernel/2010/10/27/4637977

I really don't think that deals with the same bug you are seeing.

> I'll try to reproduce the problem, in the meantime do you think the
> solution Mel proposed could be ported back to the stable kernel?

Perhaps, if Mel or one of the upstream developers does it.  I don't
believe anyone in the Debian kernel team is sufficiently familiar with
the VMM to backport this significant change.

> Kernel stack trace (excerpt) is attached.
> 
> Best,
> Giuseppe
> application log attachment (kernel.log)
> [86613.384580] Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd ixgbe dca lru_cache cn ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu bonding ext3 jbd mbcache loop snd_pcm dcdbas joydev power_meter processor button psmouse snd_timer serio_raw snd soundcore snd_page_alloc evdev pcspkr xfs exportfs sg sr_mod cdrom ata_generic usbhid hid uhci_hcd sd_mod ses crc_t10dif enclosure thermal ehci_hcd ata_piix usbcore libata megaraid_sas nls_base scsi_mod bnx2 thermal_sys [last unloaded: drbd]
> [86613.384610] CPU 2:
> [86613.384611] Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd ixgbe dca lru_cache cn ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu bonding ext3 jbd mbcache loop snd_pcm dcdbas joydev power_meter processor button psmouse snd_timer serio_raw snd soundcore snd_page_alloc evdev pcspkr xfs exportfs sg sr_mod cdrom ata_generic usbhid hid uhci_hcd sd_mod ses crc_t10dif enclosure thermal ehci_hcd ata_piix usbcore libata megaraid_sas nls_base scsi_mod bnx2 thermal_sys [last unloaded: drbd]
> [86613.384635] Pid: 207, comm: kswapd0 Not tainted 2.6.32-5-amd64 #1 PowerEdge R710
> [86613.384636] RIP: 0010:[<ffffffff810b3f19>]  [<ffffffff810b3f19>] find_get_pages+0x5f/0xbb
> [86613.384645] RSP: 0018:ffff88062c869bc0  EFLAGS: 00000293
> [86613.384646] RAX: ffffffffffffffff RBX: ffff88062c869c50 RCX: 0000000000000000
> [86613.384648] RDX: 0000000000000040 RSI: ffffea0002bc56e0 RDI: ffffea0002bc56d8
> [86613.384649] RBP: ffffffff8101166e R08: ffff88062c869b80 R09: 0000000000000002
> [86613.384651] R10: 0000000000000040 R11: ffff880093d74ad8 R12: 0000000000000005
> [86613.384653] R13: 0000000000000286 R14: ffff88000000b100 R15: ffff88000000c780
> [86613.384655] FS:  0000000000000000(0000) GS:ffff88033ac20000(0000) knlGS:0000000000000000
> [86613.384656] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [86613.384658] CR2: 00007fffd81dd038 CR3: 0000000001001000 CR4: 00000000000006e0
> [86613.384659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [86613.384661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [86613.384663] Call Trace:
> [86613.384668]  [<ffffffff810bc034>] ? pagevec_lookup+0x17/0x1e
> [86613.384671]  [<ffffffff810bcdf1>] ? invalidate_mapping_pages+0xb9/0xdb
> [86613.384675]  [<ffffffff81100573>] ? shrink_icache_memory+0xfc/0x228
> [86613.384678]  [<ffffffff810bf3f5>] ? shrink_slab+0xe0/0x153
> [86613.384680]  [<ffffffff810bfc98>] ? kswapd+0x4d9/0x686
> [86613.384683]  [<ffffffff810bd30f>] ? isolate_pages_global+0x0/0x20f
> [86613.384687]  [<ffffffff81064e96>] ? autoremove_wake_function+0x0/0x2e
> [86613.384691]  [<ffffffff8103aa56>] ? __wake_up_common+0x44/0x72
> [86613.384693]  [<ffffffff810bf7bf>] ? kswapd+0x0/0x686
> [86613.384695]  [<ffffffff81064bc9>] ? kthread+0x79/0x81
> [86613.384700]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
> [86613.384702]  [<ffffffff81064b50>] ? kthread+0x0/0x81
> [86613.384703]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20

Please provide the full log messages for this error.  If the messages
seem to be produced continually then send all messages produced in 1
second (the numbers on the left are time in seconds).

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: