[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: need help making shell script use two CPUs/cores



Camaleón put forth on 1/12/2011 3:56 AM:
> On Tue, 11 Jan 2011 15:58:45 -0600, Stan Hoeppner wrote:
> 
>> Camaleón put forth on 1/11/2011 9:38 AM:
>>
>>> I supposed you wouldn't care much in getting a script to run faster
>>> with all the available core "occupied" if you had a modern (<4 years)
>>> cpu and plenty of speedy ram because the routine you wanted to run it
>>> should not take many time... unless you were going to process
>>> "thousand" of images :-)
>>
>> That's a bit ironic.  You're suggesting the solution is to upgrade to a
>> new system with a faster processor and memory.  
> 
> Why did you get that impression? No, I said I thought you were running a 
> resource-scarce machine so in order to simulate your environment I made 
> the tests under my VM... nothing more.

My bad Camaleón.  I misunderstood what you said.  My apologies.

>> However, all the newer processors have 2, 4, 6, 8, or 12 cores.  So
>> upgrading simply for single process throughput would waste all the
>> other cores, which was the exact situation I found myself in.
> 
> But of course! I would not even think in upgrade the whole computer just 
> to get one concrete task done a few more seconds faster.

This depends on the task, of course.  It my case it just wouldn't make sense,
just as you say.  I've managed some systems that we'd upgrade every two years
because of a single application that never seemed to have enough horsepower
under the hood.  HPC compute centers seem to follow this trend.  There's never
enough cycles or enough nodes for many of them.

>> The ironic part is that parallelizing the script to maximize performance
>> on my system will also do the same for the newer chips, but to an even
>> greater degree on those with 4, 6, 8, or 12 cores.  Due to the fact that
>> convert doesn't eat 100% of a core's time during its run, and the idle
>> time in between one process finishing and xargs starting another, one
>> could probably run 16-18 parallel convert processes on a 12 core Magny
>> Cours with this script before run times stop decreasing.
> 
> I think the script should also work very well with single-core cpus.

This might depend on the hardware, but as I mentioned, it looks like the convert
program doesn't use 100% CPU during its run, so yes, using the xargs script to
fire up two concurrent convert processes with the kernel time slicing would
probably decrease overall run time to some degree.

> Yeah, and tests are there to demonstrate the gain.

Which is always a big plus.  No guess work. :)

>> I had run 4 (2 core machine) and run time was a few seconds faster than
>> 2 processes, 3 seconds IIRC.  Running 8 processes pushed the system into
>> swap and run time increased dramatically.  Given that 4 processes only
>> have a few seconds faster than two, yet consumed twice as much memory,
>> the best overall number of processes to run on this system is two.
> 
> Maybe the "best number of processes" is system-dependant (old processors 
> could work better with a conservative value but newer ones can get some 
> extra seconds with a higher one and without experiencing any significant 
> penalty).

I don't have the machines here to confirm that hypothesis, but knowledge and
experience tell me you're exactly correct.  The reasons why you're correct are
tied mostly to available L2/L3 cache bandwidth, and memory size and bandwidth.
On my SUT, one convert process at its peak easily consumes more than half the
memory bandwidth, which is why I only see a 50% reduction in run time using 2
processes, on running on each CPU, instead of a 100% reduction.  Each 500 MHz
Celeron CPU only has 128KB of L2 cache.  System memory bandwidth of the 440BX
chipset is only 800 MB/s.

Depending on the size of the photos one is converting, if they're relatively
small like my 8.3MP 1.8MB jpegs, I'd think something like a dual core Phenom II
X2 w/ 6MB L3 cache and 21.4 GB/s memory b/w would likely continue to scale with
reduced overall script run time up to 4 parallel convert processes, maybe more,
due to the "excess" of L3 cache and the 10.7 GB/s available to each core.

Conversely, I'd think that a quad core Athlon II X4 with no L3 cache and only
512KB L2 cache per core, with each core receiving effectively only 5.3 GB/s of
b/w, would not scale effectively to core_count*2 parallel processes as the
Phenom II X2 would.  In fact, due to 4 cores with little cache sharing the same
21.4 GB/s of memory b/w, the quad core Athlon II would probably start seeing a
decline in reduced run time going from 2 processes to 4 as twice as many cores
compete for memory access, and tailing off dramatically as the process count is
increased to 5 and up.

Just a guess.  Anyone have such systems to test with? :)

-- 
Stan


Reply to: