On 2019-08-13 03:51, Steffen Möller wrote:
Hello,
There are a few data formats in bioinformatics now depending on hdf5 and
h5py is used a lot. My main concern is that the user should not need to
configure anything, like a set of hostnames. And there should not be
anything stalling since it waiting for contacting a server. MPI needs to
be completely transparent and then I would very much like to see it.
MPI is generally good that way. The programs runs directly as a
simple serial program if you run it on its own, so in that sense it
should be transparent to the user (i.e. you won't know its mpi-enabled
unless you know to look for it). A multicpu job is launched via
running the program with mpirun (or mpiexec).
e.g. in the context of python and h5py, if you run
python3 -c 'import h5py'
then the job runs as a serial job, regardless of whether h5py is built
for hdf5-serial or hdf5-mpi.
If you want to run on 4 cpus, you launch the same program with
mpirun -n 4 python3 -c 'import h5py'
Then if h5py is available with hdf5-mpi, it handles hdf5 as a
multiprocessor job. If h5py here is built with hdf5-serial, then it
runs the same serial job 4 times at the same time.
To reiterate, having h5py-mpi available will be transparent to a user
interacting with hdf as a serial library. It doesn't break serial use,
it just provides the capability to also run multicpu jobs.