Re: SMP on Debian server with Hyperthreading
> On Fri, 2003-09-05 at 17:06, Jason Lim wrote:
>> Hi all,
>> Just wondering... I've got a 2.4Ghz Hyperthreading (100% it is
>> the hyperthreading model), and the BIOS sees it.
>> Hope you can advise... as hyperthreading is there but not
>> being used, which is a waste and could add performance.
Appended below is a mail from a freind of mine, his company does work
with Open Source sytems for the larger enterprises in India. He came
accross HT, and not finding any "real life" data, decided that the
only task worth doing was a kernel compile. The report is good
I am cc:ing him, he is not subscribed to this list.
Hope it helps.
Intel has released this new feature in its high-end processors, where
one processor can internally act as two processors in hardware. In
essence, there are two sets of CPU registers, two caches, two TLBs,
But there's only one external address and data bus, so the interface
between the CPU and the rest of the hardware is (largely) like that of
a single processor. Look for Hyperthreading on www.intel.com.
The interesting thing about hyperthreading is that it is done purely
hardware. This means that the OS kernel does not know the difference
between two physical processors and an Intel Xeon with hyperthreading
We recently got a client's machine for setting up a database server.
had asked for a two-processor machine, we got a dual-processor with
Hyperthreading. You go to the ROM BIOS and switch on or switch off
hyperthreading. If it is switched on, /proc/cpuinfo (we work in Linux)
shows four processors.
I wanted to see whether we get the power of four processors when we
switch on hyperthreading, in a typical SMP Linux environment. I wanted
to run a set of parallel Unix processes with and without
and see whether I got faster system throughput with four virtual
processors than with two real ones.
TYPE OF JOBS: I wanted jobs which would do some I/O but would do
primarily a lot of in-memory data manipulation. And I didn't have the
time to write custom code. So I chose C compilation. A "make" on a
source tree would give me a lot of this sort of workload.
PARALLELISM CONTROL: I used "make" with the "-j" option of GNU Make.
This controls how many parallel branches are fired by the top-level
for the compilations. It is clear that this is a less than perfect way
to generate parallel workloads, because a full compilation of a
source tree would not have a consistent degree of parallelism. I am
certain that the last part of a compilation job would be sequential,
hopefully, with a large enough source tree with sufficiently large
of independent modules, 95%+ of the compilation would have
for dozens of parallel threads.
ACTUAL WORKLOAD: The final script that I ran would do the following
actions, one after the other in /usr/src/linux:
make -j $COUNT clean
make -j $COUNT dep
make -j $COUNT bzImage
make -j $COUNT modules
As you can see, this already shows you sequential points, when one
"make" completes and the next "make" starts.
The size of the job was quite huge. I ran "make config" first, and hit
"Enter" and kept the key pressed. The resultant configuration has lots
lots of optional modules selected. For instance, the test compilations
generate 2800+ .o files. The kernel which runs on my laptop generates
just 650+ .o files when compiled.
I could see that the workload was CPU-intensive; with high
I was getting CPU idle time less than 1%. And user-state CPU usage was
93%+, the rest being in system calls. This profile is expected based
the fact that there is large RAM availability for disk cacheing.
MEASUREMENT METHOD: I ran the set of "make" commands and used the Bash
$SECONDS variable to get the system clock, +/- 1 second. This error
okay; my kernel compilation took 2500+ seconds with zero parallelism.
Moreover, with each parallelism setting, I ran the full set of "make"
commands five times, taking the clocktime measurements each time. I
averaged them using integer division. The error of +/- 5 seconds due
to integer division again should not matter; typical job runtimes were
always more than 1000 seconds.
SYSTEM CONFIG: Two physical Intel Xeon processors at 1.8GHz (as per
/proc/cpuinfo), 1 GB RAM, IDE drives, ext3 file system. (I also tried
using ext2 filesystems, but got timings identical to ext3; the ext3
journalling does not seem to be adding any significant load.) OS
was Linux 2.4.19-64GB-SMP, a stock SuSE 8.1 SMP kernel.
At peak loads, even with max parallelism, I never saw any swap space
used. This means that the only disk I/O must have been for writing out
intermediate and output files to disk. I guess there was practically
page fault occurrence on the system during my test runs, though I
bother to verify this. Max RAM usage was about 970MB with -j 6 at
The disks, though IDE, are fast. A "cat /proc/ide/piix" (that "piix"
is the system's IDE chipset) said that the hard disks were working on
UDMA 5 (133MHz IDE speed).
RESULTS: Here are the figures. The first set is with two processors.
timings are averages of five full runs, as described above.
make with -j 1: 2501 seconds.
make with -j 2: 1218 seconds.
make with -j 3: 1198 seconds.
make with -j 4: 1196 seconds.
make with -j 5: 1234 seconds.
make with -j 6: 1215 seconds.
The second set is with Hyperthreading on, i.e. with four processors as
far as Linux was concerned. I didn't do the "-j 1" run here, I didn't
see any point in repeating the zero-parallelism run.
make with -j 2: 1405 seconds.
make with -j 3: 1153 seconds.
make with -j 4: 1063 seconds.
make with -j 5: 1062 seconds.
make with -j 6: 1079 seconds.
So, as you can see, with just two parallel threads, Hyperthreading
actually degrades overall system throughput. With higher numbers of
parallel threads, Hyperthreading gives me throughput better than the
best figures without.
Therefore, if you have a large number of parallel CPU-intensive
many more than the number of virtual CPUs, I guess you're better off
Hyperthreading enabled. But with a limited amount of parallelism, I
you _MAY_ in some cases get better throughput without Hyperthreading.
with four virtual processors, I had already hit peak performance with
-j 5. At -j 6, the times had begun to climb again. This means that the
ideal system throughput comes when there is not too much more
parallelism than the number of processors. This is consistent with the
But if you feel that that four virtual processors can give you the
speedups that four physical processors can, then you can forget it.
Don't even _think_ about it.
AN ASIDE: It's amazing to see such fast processors taking 40 minutes+
a kernel compile (single threaded). These timings were so unbelievable
that I began to doubt the entire exercise. I checked whether the disk
partitions were being mounted with mount option "sync" (synchronous
writethrough), which can _really_ slow down writes. But that was not
the case. I switched from ext3 to ext2, and got no change; journalling
of ext3 does not seem to be slowing down anything. Finally, with -j 6,
I checked timings of each leg separately. "make bzImage" took 171
and created 638 .o files, "make modules" took 991 seconds and created
2238 .o files. So, I have now concluded that the extraordinarily long
time taken by my kernel compiles is because of the large set of
in the default kernel config. The benchmark figures are real.