[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Help with scheduler problems from #516374 ?



title INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
thanks

On Fri, Apr 10, 2009 at 10:18:00AM -0400, John Morrissey wrote:
> The scheduler problems in lenny's kernel (as evidenced by "task * blocked
> for more than 120 seconds") seem to be hitting a number of people and a
> variety of workloads:
> 
> #516374: INFO: task * blocked for more than 120 seconds. (ubuntu bug #276476)
> #517449: linux-image-2.6.26-1-amd64: SCHED_IDLE issues (tasks blocked for more than 120 seconds)
> #517586: "INFO: task * blocked for more than 120 seconds" causes system freeze
> #499745: linux-image-2.6.26-1-xen-686: freezes under Xen 3.2.0
> 
> Until now, I've experienced this primarily on machines running several KVM
> VMs, but have noticed it in other cases now that I've been looking for it.
[snip]
> [386715.749526] INFO: task cron:1070 blocked for more than 120 seconds.
[snip]
> I'm currently running 2.6.28 (from sid as of ~4 weeks ago) with the three
> patches mentioned in LP#276476, which has taken our heavily loaded KVM hosts
> from locking up every 3-6 days to completely stable. I looked at backporting
> the patches in question to lenny's 2.6.26, but they don't apply cleanly and
> I don't know enough about the Linux scheduler to be confident in doing it
> myself.

The SCHED_IDLE patches from LP#276476 seem to have been a red herring in
this case. In retrospect, I should have figured as much since none of the
involved tasks (kvm(1), cron(8), etc.) is SCHED_IDLE.

I've been running 2.6.28 from sid (without the SCHED_IDLE patches) on a KVM
host for about two weeks now with no instability, so it seems that some
scheduler change(s) between the 2.6.26 shipped with lenny and 2.6.28 have
fixed this problem.

I'd bisect, but reproducing this takes at least a couple days per iteration
and I don't have a specific test case that is guaranteed to reproduce, only
some general workloads that have a pretty reasonable probability of
triggering the problem given a few days.

Would someone please give me some advice on what to do next? I can try
bisecting if it's the best-by-far/only option, but it might take me a while
to work through it.

john
-- 
John Morrissey          _o            /\         ----  __o
jwm@horde.net        _-< \_          /  \       ----  <  \,
www.horde.net/    __(_)/_(_)________/    \_______(_) /_(_)__


Reply to: