[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Similar systems, different performance of fortran code



On Fri, 2005-04-22 at 21:22 +0900, Victor Munoz wrote:
> More information about this.
> 
> Now that I know that the problem seems to be in g77-3.3, I make time
> profiling line by line, comparing the output of g77-2.95 and g77-3.3 in the
> sid ("faster") machine. 
> 
> I'm compiling with -g option only, and profile with 'qprof -g line'.
> 
> The 2.95-version profile, after sorting with '-n -k2 -r', shows this at the
> top:
> 
> libm.so.6(sqrt)                                                  163    ( 11%)
> libc.so.6(__write)                                               95     (  6%)
> move_:em1ori.f:1429                                              57     (  4%)
> move_:em1ori.f:1450                                              53     (  3%)
> move_:em1ori.f:1443                                              53     (  3%)
> move_:em1ori.f:1413                                              53     (  3%)
> 
> This is the 'normal' behavior.
> 
> Then, the 3.3-version profile, shows this:
> 
> move_:em1ori.f:1417                                              1005   (  9%)
> move_:em1ori.f:1416                                              936    (  8%)
> move_:em1ori.f:1415                                              930    (  8%)
> move_:em1ori.f:1439                                              833    (  7%)
> move_:em1ori.f:1443                                              812    (  7%)
> move_:em1ori.f:1433                                              767    (  7%)
> move_:em1ori.f:1414                                              728    (  7%)
> 
> etc. In total, 18 lines with numbers above 100 (above 200 in fact), all 
> of them part of the 'move' subroutine.
> 
> However, the lines in question are not strange. The worst line, l.1417, is:
> 
> 1417:            abzpt=abz(ij+1)+del*(abz(ij+2)-abz(ij+1))+ab0z
> 
> All "heavy" lines are of the same kind:
> 
> 1417:            abzpt=abz(ij+1)+del*(abz(ij+2)-abz(ij+1))+ab0z
> 1416:            abypt=aby(ij+1)+del*(aby(ij+2)-aby(ij+1))
> 1415:            aezpt=aez(ij+1)+del*(aez(ij+2)-aez(ij+1))
> 1439:            f=2.d0/(1.d0+abxpt*abxpt+abypt*abypt+abzpt*abzpt)
> 1443:            gvxs=gvxs+vvy*abzpt-vvz*abypt+aexpt
> 1433:            vvx=gvxs+gvys*abzpt-gvzs*abypt
> 1414:            aeypt=aey(ij+1)+del*(aey(ij+2)-aey(ij+1))
> 
> etc.
> 
> These lines cover almost completely the 'move' subroutine, at least the part
> inside the loop which moves the particles. I would be happy if it were that
> simple, but it's not, since the last part of the loop, which contains lines
> such as:
> 
> 1459:            jym(ij+1)=jym(ij+1)+dells*qdtdn*vy(j)
> [...]
> 1470:            delrs=x(j)-ij
> [...]
> 1473:            rho(ij+2)=rho(ij+2)+delrs*qdxdn
> 
> etc., do not get as high counts rest:
> 
> move_:em1ori.f:1459                                              5      (  0%)
> move_:em1ori.f:1470                                              25     (  0%)
> move_:em1ori.f:1473                                              11     (  0%)
> 
> However, these lines of code are executed as many times as the previous ones.

As I noted before, memory may be playing a role here. if various arrays
do not fit in your L1 cache or your registers are being re-used lots
then that will slow things down irrespective of how fast the flop rating
of the chip. I believe you said the compilers had no optimization turned
on but have you checked they're both producing code for the right
hardware? And have you checked the changeLogs/bugReports etc to see
what's changed between compiler versions



Reply to: