[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: need help making shell script use two CPUs/cores



Bob Proulx put forth on 1/11/2011 3:08 PM:
> Stan Hoeppner wrote:
>> Camaleón put forth:
>>> real	1m44.038s
>>> user	2m5.420s
>>> sys	1m17.561s
>>>
>>> It uses 2 "convert" proccesses so the files are being run on pairs.
>>>
>>> And you can even get the job done faster if using -P8:
>>>
>>> real	1m25.255s
>>> user	2m1.792s
>>> sys	0m43.563s
>>
>> That's an unexpected result.  I would think running #cores*2^x with an
>> increasing x value would start yielding lower total run times within a few
>> multiples of #cores.
> 
> If you have enough memory (which is critical) then increasing the
> number of processes above the number of compute units *a little bit*
> is okay and increases overall throughput.
> 
> You are processing image data.  That is a large amount of disk data
> and won't ever be completely cached.  At some point the process will

Not really.  Each file, in my case, started as a 1.8MB jpeg.  The disk
throughput on my server is ~80MB/s.  Read latency is about 15-20ms on average.
In my recent example workload there were 35 such images.

> block on I/O waiting for the disk.  Perhaps not often but enough.  At
> that moment the cpu will be idle until the disk block becomes
> available.  When you are runing four processes on your two cpu machine
> that means there will always be another process in the run queue ready
> to go while waiting for the disk.  That allows processing to continue
> when otherwise it would be waiting for the disk.  I believe what you
> are seeing above is the result of being able to compute during that
> small block on I/O wait for the disk interval.

That's gotta be a very small iowait interval.  So small, in fact, it doesn't
show up in top at all.  I've watched top a few times during these runs and I
never see iowait.

I assumed the gain was simply because, watching top, each convert process
doesn't actually fully peg the cpu during the entire process run life.  Running
one or two more processes in parallel with the first two simply gives the kernel
scheduler the opportunity to run another process during those idle ticks.  There
is also the time gap between a process exiting and xargs starting up the next
one.  I have no idea how much time that takes.  But all the little bits add up
in the total execution time of all 35 processes.

> On the negative side having more processes in the run queue does
> consume a little more overhead for process scheduling.  And launching
> a lot of processes consumes resources.  So it definitely doesn't make
> sense to launch one process per image.  But being above the number of
> cpus does help a small amount.

Totally agree.  That amount of decreased run time is small enough on my system
that I don't bother with 3 processes.  I only parallelize 2, as the extra ~80MB
of memory consumed by the 3rd is better consumed by smtpd, imapd, httpd than
saving me 5-10 seconds of execution time for the batch photo resize.  This is a
server after all. ;)

> Another negative is that other tasks then suffer.  With excess compute
> capacity you always have some cpu time for the desktop side of life.
> Moving windows, rendering web pages, other user tasks, delivering
> email.  Sometimes squeezing that last percentage point out of
> something can really kill your interactive experience and end up
> frustrating you more.  So as a hint I wouldn't push too hard on it.

In my case those other tasks aren't interactive, but they exist nonetheless, as
mentioned above.

> My benchmarks show that hyperthreading (fake cpus) actually slow down
> single thread processes such as image conversions.  HT seems like a
> marketing breakthrough to me.  Although having the effective extra
> registers available may benefit a highly threaded application.  I just
> don't have any performance critical highly threaded applications.  I
> am sure they exist somewhere along with unicorns and other good
> sources of sparkles.

This has been my experience as well.  SMT traditionally doesn't work well when
you oversubscribe more compute bound processes than a machine has physical
cores.  This was discovered relatively quickly after Intel's HT CPUs hit the
market.  Folks began running one SETI@Home process per virtual CPU on dual
socket Xeon boxen, 4 processes total, and their elapsed time per process
increased substantially vs running one process per socket.

-- 
Stan


Reply to: