[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Autopkgtest and MPI code



Le 23/05/2015 14:10, Anton Gladky a écrit :
> Hi Thibaut,
> 
> for testing MPI I add the following line into the script:
> 
> export OMPI_MCA_orte_rsh_agent=/bin/false
> 
> Usually it works [1--3].

Dear Anton,

Thanks for the tip. Unfortunately it does not work (tested only on
jessie and on gyoto). Setting this variable does not change anything.

Gyoto is a bit peculiar compared to most MPI codes in that it uses
MPI_Comm_spawn to spawn workers instead of relying on mpirun to launch
several identical processes. This scenario may have issues different
from the more classical strategy.

Oddly, it seems that the shared memory transport does not work at all:
if I use "orterun --mca btl sm,self <job>", the code always crashes the
machine.

What does work on my box is:
orterun --mca btl_tcp_if_include lo <job>

This never crashes the machine, but it does not work in a chroot (for
lack of a loopback interface, I guess). I get this error message:

adt-run [19:01:17]: test python-gyoto-mpi: [-----------------------
[tantive-iv:26356] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_hnp_module.c at line 170
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[tantive-iv:26356] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[tantive-iv:26356] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 694
adt-run [19:01:18]: test python-gyoto-mpi: -----------------------]
adt-run [19:01:18]: test python-gyoto-mpi:  - - - - - - - - - - results
- - - - - - - - - -
python-gyoto-mpi     FAIL non-zero exit status 243
thibaut@tantive-iv:~/git/gyoto$

Kind regards, Thibaut.


> 
> 
> [1] http://anonscm.debian.org/cgit/debian-science/packages/esys-particle.git/tree/debian/tests/build1#n7
> [2] https://anonscm.debian.org/cgit/debian-science/packages/liggghts.git/tree/debian/tests/heat
> [3] https://anonscm.debian.org/cgit/debian-science/packages/liggghts.git/tree/debian/tests/packing
> 
> Best regards
> 
> Anton
> 
> Anton
> 
> 
> 2015-05-23 10:41 GMT+02:00 Thibaut Paumard <thibaut@debian.org>:
>> Hi,
>>
>> I'm working on autopkgtest support in one of my packages, gyoto.
>>
>> The upcoming upstream release (preview available on our alioth git repo)
>> features MPI parallelisation, and I want to test this feature.
>>
>> In my experience, running MPI code requires network access. Failing
>> that, openmpi hangs the machine. For instance, when debugging in the
>> subway, I have to connect to my cell-phone by wifi else the computer
>> will freeze during the test suite!
>>
>> I'm wondering whether putting "Restrictions: isolation-container" in the
>> test stanza is sufficient to ensure openmpi will behave properly?
>>
>> Also akin, is there a way to test the code during build, since it is
>> forbidden to access the network at that time?
>>
>> Kind regards, Thibaut.
>>
>>
>> --
>> To UNSUBSCRIBE, email to debian-science-request@lists.debian.org
>> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>> Archive: [🔎] 55603D25.1080205@debian.org">https://lists.debian.org/[🔎] 55603D25.1080205@debian.org
>>
> 
> 


Reply to: