[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: h5py and hdf5-mpi



Hello,

On 12.08.19 18:15, Ghislain Vaillant wrote:
Le lun. 12 août 2019 à 17:04, Mo Zhou <lumin@debian.org> a écrit :
Hi Drew,

thanks for the commits to h5py.

On 2019-08-12 03:10, Drew Parsons wrote:
We need to change h5py to support hdf5-mpi.  h5py is somewhat crippled
as serial-only.
I didn't even notice that since my use case for hdf5 is light-weight.
(training data is fully loaded from hdf5 into memory)
Same here. My use case for h5py is for storing medical images and raw
data, all of which usually fit into a single workstation.

We could just do it straight away in python3-h5py.  Is there much
point having h5py support both hdf5-serial and hdf5-mpi?  Perhaps
there is, in which case we need to set up multiple builds and use
alternatives to set the preferred h5py.
In fact I don't know. Maybe @ghisvail could answer?
I can't answer this question since I have never used the parallel
builds of HDF5 and h5py.

Are we really sure alternatives are appropriate for this particular use case?

Python has got other means for injecting alternative dependencies such
as PYTHONPATH and virtualenvs.

A related question, is there much point setting up support for
hdf5-mpich as well as hdf5-openmpi?  Increasing build and
package-alternatives complexity, but once it's done once to
distinguish hdf5-serial from hdf5-mpi, it's not that much more work to
also split hdf5-mpi between hdf5-mpich and hdf5-openmpi.
My personal opinion is to just choose a reasonable default,
unless users shouted for that.
Same here.

We can't catter to every use case in the scientific community, so the
best we can do is choose something sensible with the data point we
have got (if any) and later reconsider with users feedback.

Compiling every possible configuration will eventually make
the science team maintenance burden notorious. h5py is not
like BLAS64/BLAS-flavours which are clearly needed by some
portion of scientific users.
There is also the question of long-term maintainability. For HPC
builds, people will build their stack from source anyway for maximum
performance on their dedicated hardware. That was the case back when I
used to work for a university. I don't think targeting these users is
worth the trouble compared to research staff who want to prototype or
deploy something quickly on their respective workstation or laptop
where resources are more constrained. That's the background I am
coming from personally, hence why MPI never was considered at the
time.

Your mileage may vary of course, and I welcome (and value) your opinions.

Please let me know.

There are a few data formats in bioinformatics now depending on hdf5 and
h5py is used a lot. My main concern is that the user should not need to
configure anything, like a set of hostnames. And there should not be
anything stalling since it waiting for contacting a server. MPI needs to
be completely transparent and then I would very much like to see it.

For packaging I would prefer it all to be as simple as possible, so not
dragging in MPI would be nice, i.e. I would like to see the -serial
package that provides hdf5. As long as the two different flavours of MPI
cannot be used in mixed setups, I suggest to have hdf5-openmpi and also
hdf5-mpich if you still have the energy left.

How do autotests work for MPI?

Cheers,

Steffen


Reply to: