Massive load average, can't log in
This morning I noticed that my Etch box wouldn't let me start up a web
browser window. Odd I thought, so I took a look at the load average,
which was sitting at 210! Some processes such as vi or firefox can no
longer launch, but some simpler ones such as ps or top can still be run
from a regular user terminal. A ps uaxw reveals dozens of crond
processes in the all too familiar 'D' state, each one having a similarly
stalled mrtg process. I've never even used mrtg on that machine save
just doing an apt-get install mrtg.
If I try to su - to kill some processes, that particular terminal goes
into an interruptible sleep and I have to switch to another one. I get
something similar when I try to log in remotely - it never gets to a
password prompt. CPU is at 0%, memory usage is below 50% and there is
no disk activity.
So in light of this I have two questions:
1. Why would the mrtg cron job be stalling? Is there a known problem
with this program or is it looking for some non-existent nfs share?
2. Why can't I log in or start any new large processes? Is there some
load average threshold in Debian above which no one is allowed to log
in? A high load average does not suggest high disk/cpu/memory usage,
just stalled processes so there is plenty of computing power available.
Perhaps the load average calculation needs to be updated to ignore
processes that have stalled for a period?