[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Potential bug in openmpi package?



Hi there,
I'm trying to use openmpi with debian stretch. I found I can't run program on slave nodes. mpirun will give me the error attached at the end of mail. But when I switched to Ubuntu distro, it works fine.
My env is :
Debian stretch
openmpi-bin 2.0.2
tested in kvm machines and docker.

I'm pretty sure the ssh connection is good. Actually there is no connection on slave's ssh log. I think it's some bug in orted. Because when I run orted, it shows some error message. According to the message, the my_hnp_uri is NULL and then cause some fatal error.

Does anyone have some idea about this?


openmpi error:

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------




orted error:


[m5101:02005] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 145
[m5101:02005] [[INVALID],INVALID] bind() failed on error No such file or directory (2)
[m5101:02005] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 247
[m5101:02005] [[INVALID],INVALID] ORTE_ERROR_LOG: Fatal in file routed_radix.c at line 476
[m5101:02005] [[INVALID],INVALID] ORTE_ERROR_LOG: Fatal in file base/ess_base_std_orted.c at line 425
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_routed.init_routes failed
  --> Returned value Fatal (-6) instead of ORTE_SUCCESS
--------------------------------------------------------------------------



Best regards,
Haiyu

Reply to: