[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Help with scheduler problems from #516374 ?



The scheduler problems in lenny's kernel (as evidenced by "task * blocked
for more than 120 seconds") seem to be hitting a number of people and a
variety of workloads:

#516374: INFO: task * blocked for more than 120 seconds. (ubuntu bug #276476)
#517449: linux-image-2.6.26-1-amd64: SCHED_IDLE issues (tasks blocked for more than 120 seconds)
#517586: "INFO: task * blocked for more than 120 seconds" causes system freeze
#499745: linux-image-2.6.26-1-xen-686: freezes under Xen 3.2.0

Until now, I've experienced this primarily on machines running several KVM
VMs, but have noticed it in other cases now that I've been looking for it.

For example, on a 2x2.4GHz Xeon machine with 2GB of RAM running a moderately
loaded OpenLDAP slapd (very little disk I/O, ~65% of its memory and 40-50%
CPU used):

[386715.749526] INFO: task cron:1070 blocked for more than 120 seconds.
[386715.749579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[386715.749628] cron          D 00000000     0  1070   3130
[386715.749635]        f7ed2140 00200086 00000000 00000000 c0354e40 f7ed22cc c2011fa0 00000000 
[386715.749648]        00000003 041b8f25 00000000 c011cb0f 00000000 00000000 00000000 000000ff 
[386715.749658]        7fffffff 7fffffff c3c63f68 00000002 c02b8519 f762b340 ffffffff f7566688 
[386715.749670] Call Trace:
[386715.749703]  [<c011cb0f>] sched_balance_self+0x1ce/0x227
[386715.749726]  [<c02b8519>] schedule_timeout+0x13/0x86
[386715.749749]  [<c02b7c3d>] wait_for_common+0xaf/0x10f
[386715.749759]  [<c011b682>] default_wake_function+0x0/0x8
[386715.749774]  [<c0121b89>] do_fork+0x17f/0x1dc
[386715.749792]  [<c0102173>] sys_vfork+0x18/0x1c
[386715.749801]  [<c01038ce>] syscall_call+0x7/0xb

I'm currently running 2.6.28 (from sid as of ~4 weeks ago) with the three
patches mentioned in LP#276476, which has taken our heavily loaded KVM hosts
from locking up every 3-6 days to completely stable. I looked at backporting
the patches in question to lenny's 2.6.26, but they don't apply cleanly and
I don't know enough about the Linux scheduler to be confident in doing it
myself.

Given that this bug seems to be affecting a number of people in substantial
ways, could these changes be backported to 2.6.26, perhaps with an upload to
proposed-updates? Even if a lenny update to fix this problem isn't in the
cards, would someone with more kernel knowledge be willing to help me
fix 2.6.26? I'm willing to provide any testing or other assistance; I just
don't have the specialized knowledge to make this fix in 2.6.26.

john
-- 
John Morrissey          _o            /\         ----  __o
jwm@horde.net        _-< \_          /  \       ----  <  \,
www.horde.net/    __(_)/_(_)________/    \_______(_) /_(_)__


Reply to: