[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

out of memory



System; two dual-opteron, amd64 etch, 16GB ram, raid1.

Heavy computation (memory either 1750 mb or 3750 mb
per node) started with high % cpu usage and little
usage of memory. Then, the two factors inverted, the
HD led became lighted without interruption. The
computation then closed "incomplete" with the warning
message in the output file;

******************* ARMCI INFO
************************
The application attempted to allocate a shared memory
segment of 38731776 bytes in size. This might be in
addition to segments that were allocated succesfully
previously. The current system configuration does not
allow enough shared memory to be allocated to the
application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory
segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if
SHMMAX matches the amount of memory needed by your
application and the system has sufficient amount of
swap space. Most UNIX systems can be easily
reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by
insufficient swap space.
*******************************************************
0:allocate: failed to create shared region : -1
0:allocate: failed to create shared region : -1
Last System Error Message from Task 0:: Invalid
argument
  0: ARMCI aborting -1 (0xffffffffffffffff).
  0: ARMCI aborting -1 (0xffffffffffffffff).
system error message: Invalid argument
3:SigIntHandler: interrupt signal was caught: 2
3:SigIntHandler: interrupt signal was caught: 2
Last System Error Message from Task 3:: No such file
or directory
  3: ARMCI aborting 2 (0x2).
  3: ARMCI aborting 2 (0x2).
system error message: No such file or directory
1:SigIntHandler: interrupt signal was caught: 2
1:SigIntHandler: interrupt signal was caught: 2
Last System Error Message from Task 1:: No such file
or directory
  1: ARMCI aborting 2 (0x2).
  1: ARMCI aborting 2 (0x2).
system error message: No such file or directory
2:SigIntHandler: interrupt signal was caught: 2
2:SigIntHandler: interrupt signal was caught: 2
Last System Error Message from Task 2:: No such file
or directory
  2: ARMCI aborting 2 (0x2).
  2: ARMCI aborting 2 (0x2).
system error message: No such file or directory
 Creating: host=deb64, user=francesco,
           file=/home/francesco/nwchem50/bin/nwchem,
port=57429
  4: interrupt(1)
WaitAll: No children or error in wait?

The http suggested above is of no help, referring to
Linux kernel 2)

This warning message was exaclty the same with the two
different mem allocations. With smaller matrices (ie
smaller molecules) the same type of computation ends
OK.

Thanks for suggestions how to tune the system.

francesco pietra


 
____________________________________________________________________________________
Looking for earth-friendly autos? 
Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
http://autos.yahoo.com/green_center/



Reply to: