Re: Similar systems, different performance of fortran code
On Fri, 2005-04-22 at 21:22 +0900, Victor Munoz wrote:
> More information about this.
>
> Now that I know that the problem seems to be in g77-3.3, I make time
> profiling line by line, comparing the output of g77-2.95 and g77-3.3 in the
> sid ("faster") machine.
>
> I'm compiling with -g option only, and profile with 'qprof -g line'.
>
> The 2.95-version profile, after sorting with '-n -k2 -r', shows this at the
> top:
>
> libm.so.6(sqrt) 163 ( 11%)
> libc.so.6(__write) 95 ( 6%)
> move_:em1ori.f:1429 57 ( 4%)
> move_:em1ori.f:1450 53 ( 3%)
> move_:em1ori.f:1443 53 ( 3%)
> move_:em1ori.f:1413 53 ( 3%)
>
> This is the 'normal' behavior.
>
> Then, the 3.3-version profile, shows this:
>
> move_:em1ori.f:1417 1005 ( 9%)
> move_:em1ori.f:1416 936 ( 8%)
> move_:em1ori.f:1415 930 ( 8%)
> move_:em1ori.f:1439 833 ( 7%)
> move_:em1ori.f:1443 812 ( 7%)
> move_:em1ori.f:1433 767 ( 7%)
> move_:em1ori.f:1414 728 ( 7%)
>
> etc. In total, 18 lines with numbers above 100 (above 200 in fact), all
> of them part of the 'move' subroutine.
>
> However, the lines in question are not strange. The worst line, l.1417, is:
>
> 1417: abzpt=abz(ij+1)+del*(abz(ij+2)-abz(ij+1))+ab0z
>
> All "heavy" lines are of the same kind:
>
> 1417: abzpt=abz(ij+1)+del*(abz(ij+2)-abz(ij+1))+ab0z
> 1416: abypt=aby(ij+1)+del*(aby(ij+2)-aby(ij+1))
> 1415: aezpt=aez(ij+1)+del*(aez(ij+2)-aez(ij+1))
> 1439: f=2.d0/(1.d0+abxpt*abxpt+abypt*abypt+abzpt*abzpt)
> 1443: gvxs=gvxs+vvy*abzpt-vvz*abypt+aexpt
> 1433: vvx=gvxs+gvys*abzpt-gvzs*abypt
> 1414: aeypt=aey(ij+1)+del*(aey(ij+2)-aey(ij+1))
>
> etc.
>
> These lines cover almost completely the 'move' subroutine, at least the part
> inside the loop which moves the particles. I would be happy if it were that
> simple, but it's not, since the last part of the loop, which contains lines
> such as:
>
> 1459: jym(ij+1)=jym(ij+1)+dells*qdtdn*vy(j)
> [...]
> 1470: delrs=x(j)-ij
> [...]
> 1473: rho(ij+2)=rho(ij+2)+delrs*qdxdn
>
> etc., do not get as high counts rest:
>
> move_:em1ori.f:1459 5 ( 0%)
> move_:em1ori.f:1470 25 ( 0%)
> move_:em1ori.f:1473 11 ( 0%)
>
> However, these lines of code are executed as many times as the previous ones.
As I noted before, memory may be playing a role here. if various arrays
do not fit in your L1 cache or your registers are being re-used lots
then that will slow things down irrespective of how fast the flop rating
of the chip. I believe you said the compilers had no optimization turned
on but have you checked they're both producing code for the right
hardware? And have you checked the changeLogs/bugReports etc to see
what's changed between compiler versions
Reply to: