I am using a cluster of machines running Debian 5.0.4, kernel 2.6.26-2-amd64. These machines have dual Intel Xeon E5530 2.4GHz CPUs, which are quad-core CPUs with hyperthreading. So that means each machine has 8 physical CPUs and a total of 16 logical CPUs.
I have run into an apparent issue with the kernel scheduler. Under the circumstances described below, the scheduler will run two tasks on two logical CPUs of the same physical CPU, even if all the remaining physical CPUs are idle. This obviously causes a large slowdown for these tasks.
What I’m doing is this. I have a simple process that reads a file from disk and performs some computation. The process is largely CPU bound, so if execution of one such task takes N seconds, I would expect execution of two parallel tasks to also take N seconds in the absence of other tasks on the system. However, if these two tasks are the only thing running in the system, the scheduler will consistently assign one task to CPU 0 and the other to CPU 8. Since these are logical CPUs on the same physical CPU, the actual run time of the two parallel task is closer to 1.8N, much slower than what is possible.
The problem seems to arise from I/O interrupt handling. If I look at /proc/interrupts, it seems that all interrupts are handled by the first physical CPU. These are then apparently processed by one of this CPUs logical CPUs (which corresponds to CPU 0 and 8). Once the tasks have run on these CPUs, natural affinity ensures that the kernel scheduler will keep them there. This leads to the interesting observation that if I create two tasks that do no I/O (for example because all their I/O requests could be satisfied by the cache) it is scheduled on two random CPUs and runs fast, but if there is even a single I/O operation causing an interrupt anywhere in the process, from that point on the tasks stay on CPU 0 and 8, even if they do no further I/O, and will be much slower.
It seems to me that the proper behavior for the kernel scheduler should be to give a higher penalty to running a task on a logical CPU whose logical sibling is also being used while other physical CPUs are available than it does to moving a thread to a different CPU, but it appears that isn’t the case.
I can work around this issue by setting CPU affinity for the tasks to CPUs 0-7, effectively disabling hyperthreading. However, this is not an ideal solution.
My question then is twofold. Firstly, why are all interrupts being handled by the first CPU? I checked the various /proc/irq/#/smp_affinity entries and they are all 0000ffff so that’s not the issue. By changing the value in those files to a specific CPU I can get the interrupts to be handled by a different CPU, but that just moves the problem. No matter what I do, I can’t get them to be handled by more than one CPU. I’ve tried running irqbalance but that also didn’t help. Is there a way to prevent this interrupt CPU affinity, and if so would it fix my problem?
Secondly, why does the scheduler not realize that satisfying natural affinity is not a good idea if the CPUs involved are logical siblings of each other on the same physical CPU? I thought that the Linux kernel was hyperthreading-aware and would take these kinds of things into consideration. Is this a true shortcoming of the scheduler, or is my system misconfigured somehow?
I hope you will be able to help.