monitoring load average
I am involved with setting up NetSaint monitoring of a medium size network.
One problem I have is determining suitable ways of monitoring system load. A
machine with 100% usage of a resource by server processes will have request
queues that grow indefinately (and performance will suck).
So the load average doesn't seem particularly useful. If a machine has a
sustained load average of 3.0 from from CPU operations and it has two CPUs
then that indicates a problem. If it is from disk operations and there are
four disks in a RAID-5 array then it's equal to the number of non-parity
stripes and the load is probably at the limit of what it can handle. If it's
half from CPU and half from disk then it shouldn't be a problem at all.
I think that perhaps a better way would be to have one test measure on the
amount of CPU time used (the sum of the "user" and "system" percentages of
the CPU usage as reported by top would do - nice time doesn't matter).
Then I could have another test measure the disk utilization in terms of the
await, svctm, or %util fields as reported by iostat.
http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/ Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/ My home page