[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: severe I/O performance issues on 2.4.22 SMP system

Hi Daniel

My experience is that SMP is only great in some specific situations, normally where all the software and hardware is built for the specific SMP implementation.

As an example, I had a lot of problems with DPTs RAID controllers when they moved to I20 based hardware, and I started using Dual Xeon processor systems.
DPT investigated quickly and found that the lock timers that counted to ever increasing numbers while waiting for IO, keep hitting their maximum digit size in software as the server was just too fast.

Some wait loops, count to 10, try again, then will count to 100, then try again, then count to 1000 and try again, until they either succeed or the code hits it's mathematical limit and stops, normally until some watchdog timer resets the circuit and it all starts again.

On a fast multi processor system, a driver written a year ago, could easily not take into account the extremely large numbers these wait counters will need to go to, before the 1 nano second has elapsed and the resource is freed by other processes.

Also to allow other devices to access resources, drivers may have a wait loop counting to say 1000 before re accessing the device. Again if each separate processor is using the same resource and this count happens so fast the the resource has not been released, the resource will appear to be continually in use, when in fact it is just the wait loops are to fast and the max tries is too small.

I'm not sure about your specific situation, but the symptoms that you state are identical to these issues that I had. I tend these days to not use SMP much as most SMP systems tend to be Hybrids, with anomalies that just waste too much time trying to solve. I tend to just run more servers for the same application, which does also provide some form of redundancy. If I run two fast mail servers rather than one huge SMP system, I only loose half my email when a component or upgrade fails. This also tends to be a cheaper, easier to get going and maintain.

Hope this helps.

Best regards
Glenn Hocking
Publish Media Pty Ltd

Daniel Erat wrote:

I've been experiencing some serious performance issues that appear to be
I/O-related with a dual-Xeon 2.4.22 mail server.

The server has a SuperMicro Super P4DP6 motherboard with dual Xeon
2.4Ghz processors and 4 GB of RAM.  One of the two onboard Adaptec 7899P
SCSI chipsets is being used to control a disk that has the OS, and a
QLogic QLA2200 fibre channel PCI card is being used to connect an
external array.  A two-gigabyte partition on the disk is devoted to
swap.  The server is running Debian 3.0 with a hand-compiled 2.4.22

Once all of the RAM is being used to cache stuff, sudden spikes in
activity seem to cause the machine to grind to a halt for anywhere
between thirty seconds and twenty minutes.  I'm not sure how much of
this is a result of a bunch of SMTP, POP3, and MySQL processes piling up
and waiting for the disk, but I am able to trigger the problem at will
by running the "find" command on the root filesystem or the external
array.  Swap usage never goes beyond a couple of megabytes.  When "ps"
finally finishes running while the machine is in this state, it just
shows a lot of processes in the sleep state and a lot of zombies that
are exiting.  The problem goes away, at least for a while, if I kill all
the MTA-related processes and then restart them.  I don't see any
relevant messages being produced by syslog.  The server isn't getting
hammered by spammers or anything like that when it becomes unresponsive.

I've tried using both the kernel's qlogicfc driver and QLogic's qla2200
driver, but neither has any clear advantage over the other.  The problem
seems to be able to be caused by heavy usage of either the SCSI disk or
the array, anyway, so my suspicion would be that the bottleneck is
occurring at a higher level in the kernel.

I have High Memory Support, HIGHMEM I/O Support, SMP, and ACPI enabled,
and I experience the same behavior when HIGHMEM I/O and ACPI are turned
off.  I am using the aic7xxx driver.  I can put the .config file online
as well, if that would help.

Anyone have any ideas?  Thanks,


Reply to: