[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: high load average



on Fri, Mar 09, 2001 at 01:27:50AM -0600, Dave Sherohman (esper@sherohman.org) wrote:
> On Thu, Mar 08, 2001 at 10:55:10PM -0800, kmself@ix.netcom.com wrote:
> > on Tue, Mar 06, 2001 at 11:21:07AM -0600, Dave Sherohman (esper@sherohman.org) wrote:
> > > You have the notation correct, but load average and CPU
> > > utilization are not directly related.  Load average is the average
> > > number of processes that are waiting on system resources over a
> > > certain time period; they could be waiting for CPU, for I/O, or
> > > for other resources.
> 
> > It *is* CPU.  These are processes in the run queue.  A process
> > blocked for I/O or another resource is blocked, not runnable
> 
> OK, now I'm confused...

I'm also somewhat fallible.  So, we'll get to the source of the question
this time.

In particular, a job blocked for I/O *is* runnable.  My error.

> My statements were based on my memory of a thread from last May (was
> it that long ago?) on this very list titled "(ot) What is load
> average?".  Checking back on the messages I saved from that
> conversation, I see a one from kmself@ix.netcom.com stating that load
> average is
> 
> | Number of processes in the run queue, averaged over time.  Often
> | confused with CPU utilization, which it is not.
> 
> Load average either is CPU or it isn't, right?  

Percent of clock ticks being utilized is CPU utilization.  Number of
jobs in runnable state is load average.  Related, but not identical
metrics.

My own statement:

    Load average is a measure of _average current requests for CPU
processing_ over some time interval.

While we're at it, let's pull in a more authoritative definition, this
from _System Performance Tuning_, by Mike Loukides, O'Reilly, 1990:

    The _system load average_ provides a convenient way to summarize the
    activity on a system.  It is the first statistic you should look at
    when performance seems to be poor.  UNIX defines load average as the
    average number of processes in the kernel's run queue during an
    interval.  A _process_ is a single stream of instructions.  Most
    programs run as a single process, but some sapwn (UNIX terminology:
    _fork_) other processes as they run.  A process is in the run queue
    if it is:

      * Not waiting for any external event (e.g., not waiting for
        someone to type a character at a terminal).
      
      * Not waiting of its own accord (e.g., the job hasn't called 'wait'.)

      * Not stopped (e.g., the job hasn't been stopped by CTRL-Z).
        Processes cannot be stopped on XENIX and versions of System V.2.
        The ability to stop processes has been added to System V.4 and
        some versions of V.3.

    While the load average is convenient, it may not give you an
    accurate picture of the system's load.  There are two primary
    reasons for this innaccuracy:

      * The load average counts as runnable all jobs waiting for disk
        I/O.  This includes processes that are waiting for disk
	operations to complete across NFS.  If an NFS server is not
	responding (e.g., if the network is faulty or the server has
	crashed), a percoess can wait for hours for an NFS operation to
	complete.  It is considered runnable the entire time even though
	nothing is happening; therefore, the load average climbs when
	NFS servers crash, even though the system isn't really doing any
	more work.

      * The load average does not account for scheduling priority.  It
	does not differentiate between jobs that have been niced (i.e.,
	placed at a lower priority and therefore not consuming much CPU
	time) or jobs that are running at a high priority.

Hopefully, that clarifies a few misperceptions and sloppy statements (my
own included).

Specific to GNU/Linux, the count of active tasks is computed in
kernel/sched.c as:

    static unsigned long count_active_tasks(void)
    {
	    struct task_struct *p;
	    unsigned long nr = 0;

	    read_lock(&tasklist_lock);
	    for_each_task(p) {
		    if ((p->state == TASK_RUNNING ||
			 p->state == TASK_UNINTERRUPTIBLE ||
			 p->state == TASK_SWAPPING))
			    nr += FIXED_1;
	    }
	    read_unlock(&tasklist_lock);
	    return nr;
    }

> So you can't have been correct both times.  

No, I am.  You're just not reading me consistently ;-)

My admonition in the current thread that load average is a metric of CPU
utilization is just that:  load average is concerned with CPU, it is
*not* concerned with memory, disk I/O (though I/O blocking can effect it),
etc.  However, as I clarify in this current post, and my prior thread,
load average is not equivalent to CPU _utilization_.

To put it in different terms:

   - Load average is how often you're asking for it.
   - CPU utilization is how often you're getting it.

High load average means you've got more requests than you can handle.
But the actual efficiency of servicing of requests is not considered.

High utilization means you're processing a large number of requests --
but the number of requests in queue is not considered.

> Now, you may have been wrong last year and since realized that it's
> more CPU-related than you had thought, but (aside from this thread's
> original question describing a situation with a long-term consistent
> load average of 2.00 and low-to-no CPU utilization) last May's thread
> also included a message from jjlupa@jamdata.net stating that
> 
> ] It is the average number of processes in the 'R' (running/runnable) state
> ] (or blocked on I/O).
> 
> and
> 
> ] The load average is most directly related to CPU.  Two CPU-intensive
> ] processes running will result in a load average of 2, etc.  But I/O
> ] intensive processes spend so much time active that they can drive up
> ] the load average also.  In addition if more than one process is
> ] blocked on I/O then the load average will go up very quickly, as
> ] both processes count toward the load even if only one can access the
> ] disk at a time.
> 
> Based on my observations of load and CPU readings on my boxes and the
> messages from last May that I quoted above, I'm inclined to maintain
> my earlier statement that processes waiting on any resource (not just
> CPU) contribute to load.  But, if that's not the case, I'm willing to
> be corrected.

The clarification is given in the O'Reilly citation.  Runnable
processes, not waiting on other resources, I/O blocking excepted.

-- 
Karsten M. Self <kmself@ix.netcom.com>    http://kmself.home.netcom.com/
 What part of "Gestalt" don't you understand?       There is no K5 cabal
  http://gestalt-system.sourceforge.net/         http://www.kuro5hin.org

Attachment: pgpZg_Yzb1u1g.pgp
Description: PGP signature


Reply to: