[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: need help making shell script use two CPUs/cores



Bob Proulx put forth on 1/12/2011 1:11 PM:
> Stan Hoeppner wrote:
>> Bob Proulx put forth:
>>> when otherwise it would be waiting for the disk.  I believe what you
>>> are seeing above is the result of being able to compute during that
>>> small block on I/O wait for the disk interval.
>>
>> That's gotta be a very small iowait interval.  So small, in fact, it
>> doesn't show up in top at all.  I've watched top a few times during
>> these runs and I never see iowait.
> 
> I would expect it to be very small.  So small that you won't see it by
> eye when looking at it with top.  Motion pictures run at 24 frames per
> second.  That is quite good enough for your eye to see it as
> continuous motion.  But to a computer 1/24th of a second is a long
> time.  I don't think you will be able to observe this by looking at it
> with top and a one second update interval.

My point wasn't that not seeing it meant that it wasn't happening.  I'm sure I'd
have seen something had I run iostat.  But being that small, with a total script
run time of over a minute, how does the IO wait time come into play, to any
significant degree, if the total IO wait is maybe 2 seconds?
(apt analogy btw--good for others who may not have understood otherwise)

>> I assumed the gain was simply because, watching top, each convert
>> process doesn't actually fully peg the cpu during the entire process
>> run life.  Running one or two more processes in parallel with the
>> first two simply gives the kernel scheduler the opportunity to run
>> another process during those idle ticks.
> 
> Uhm...  But that is pretty much exactly what I said!  :-) "Doesn't
> actually fully peg the cpu" is because eventually it will need to
> block on I/O from the disk.  The process will run until it either
> blocks or is interrupted at the end of its timeslice.  Do you propose
> other reasons for the process not to "fully peg the cpu" than for I/O
> waits?

Yes, I do.  I've not looked at the code so I can't say for sure.  However,
watching top (yes, not that accurate) during the runs showed periods of multiple
seconds where each convert process was only running at 60% CPU. Then it would
bump back to 100%.  IIRC this happened multiple times.  Considering this is an
image processing program, I would _assume_ the entire image file is loaded into
memory upon startup.  After processing is complete the image file is written
out.  I don't see why the process would be accessing disk during its run,
especially with these small 1.8 MB jpg files.  Thus, I am guessing that there
are a couple of code routines in the conversion process that just don't peg the
CPU.  Or, is it possible that memory contention between the two CPUs causes this
"less than 100%" CPU usage reported in top, and when each is running 100% CPU
that most of the workload is actually in that tiny 128KB L2 cache?  I'm not a
top expert.  If a process blocks on memory wait does the kernel still report the
process as 100% CPU or lower?

Anyway, these are the two possible reasons I propose for the less than 100% CPU
usage of the convert processes.  I'm making educated guesses here, not stating fact.

>> There is also the time gap between a process exiting and xargs
>> starting up the next one.
> 
> But what would be the cause of that gap?  Waiting on disk to load the
> executable?  (Actually it should be cached into filesystem buffer
> cache and and not have to wait for the disk.)  AFAIK there isn't any
> gap there.  (Actually as long as there is another convert process in
> memory then the next one will start very quickly by being able to
> reuse the same memory code pages.)

As you said, top's 1 second interval, and the manner in which it displays what
is happening, may be masking what's really going on.  What I've stated was
looking at the %CPU for each process, not the summary area %CPU.  Likely what I
described as a "gap" was merely one convert PID dying and another starting up at
another location further up the screen.  With each of these things occurring in
a different "frame" that would explain the "appearance" of a time gap.

So, I'd say I was wrong in describing that as a time gap.  I'd have to do some
testing with other tools to absolutely verify all of this.  Frankly I'd rather
not waste the time on it at this point.  You solved my original problem Bob!
Thank again.  That was the important takeaway here.  Now we're into minutia
(which can be fun but I'm spending way too much time on debian-user email the
last few days)

>> I have no idea how much time that takes.  But all the little bits
>> add up in the total execution time of all 35 processes.
> 
> Yes.  All of the little bits add up and I believe accounts for the
> decrease in total wall-clock time from start to finish.  A small but
> measurable value.
> 
> And I think we were in agreement about everything else.  :-)

Yep.  Chalk all this up to incorrect data due to insufficient frame rate. :)

Ahh, something else I just realized.  Feel free to slap me if you like. :)

Given this is a production mx mail and web server, it's very likely that daemons
awoke and ate some CPU without causing a highlight change in top.  Since I was
intensely watching the convert processes, I may not have noticed, or simply
ignored them.  That's a better explanation for the less than 100% CPU per
convert process than anything else, and far more likely.  smtpd, imapd,
lighttpdd, etc are frequently firing and eating little bits of CPU.  This is a
personal server so the traffic is small, but nonetheless daemons are firing
regularly.  Postfix alone fires 3 or 4 daemons when mail arrives.  None of these
eat much CPU time, but they all add up.  And context switching on a 550 MHz CPU
with only 128K L2 cache is going to be expensive when two compute intensive
tasks are running.  I think I can state with confidence at this point that this
is the reason for the "gaps" I saw.

Silly me.  This is what happens kids when you focus too intently on one thing,
and when not being thorough and using the right monitoring tool for the job at
hand. :)

-- 
Stan



Reply to: