[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: h5py and hdf5-mpi



On 2019-08-13 00:15, Ghislain Vaillant wrote:
Le lun. 12 août 2019 à 17:04, Mo Zhou <lumin@debian.org> a écrit :


On 2019-08-12 03:10, Drew Parsons wrote:
> We need to change h5py to support hdf5-mpi.  h5py is somewhat crippled
> as serial-only.

I didn't even notice that since my use case for hdf5 is light-weight.
(training data is fully loaded from hdf5 into memory)

Same here. My use case for h5py is for storing medical images and raw
data, all of which usually fit into a single workstation.

Reasonable to keep the hfd5-serial version then.

It sounds like your use-case is post-processing of data. The use-case I have in mind is use of h5py during computation, e.g. supplementing FEniCS jobs. (FEniCS itself uses hdf5-mpi in its C++ backend library). Using cluster calculations for instance, cloud computing.


> We could just do it straight away in python3-h5py.  Is there much
> point having h5py support both hdf5-serial and hdf5-mpi?  Perhaps
> there is, in which case we need to set up multiple builds and use
> alternatives to set the preferred h5py.
...
Are we really sure alternatives are appropriate for this particular use case?

Python has got other means for injecting alternative dependencies such
as PYTHONPATH and virtualenvs.

PYTHONPATH is a solution for individual users to override the system default, but it's policy not to rely on env variables for the system installation. I'm not familiar with virtualenvs. I gather its also a user-mechanism to override the default configuration.

So the question is, which h5py (which hdf5) 'import h5py' should be working with by default. For cloud computing installations, hdf5-mpi makes sense. Even for workstations, most are multi-cpu these days.

It's not so hard to setup up alternatives links to point /usr/lib/python3/dist-packages/h5py at h5py-serial or h5py-mpi. I've done it for the real and complex variants of petsc4py. For additional entertainment h5py-serial and h5py-mpi could be installed alongside each other in the normal python modules directory, which means one could consider 'import h5py-serial' or 'import h5py-mpi' directly. I think that operates with the effect of 'import h5py as h5py-serial' but might not be robust. A robust approach would place the h5py-serial and h5py-mpi directories elsewhere, where a user's PYTHONPATH could specify them independently of the default.


> A related question, is there much point setting up support for
> hdf5-mpich as well as hdf5-openmpi?  Increasing build and
> package-alternatives complexity, but once it's done once to
> distinguish hdf5-serial from hdf5-mpi, it's not that much more work to
> also split hdf5-mpi between hdf5-mpich and hdf5-openmpi.

My personal opinion is to just choose a reasonable default,
unless users shouted for that.

Same here.

We can't catter to every use case in the scientific community, so the
best we can do is choose something sensible with the data point we
have got (if any) and later reconsider with users feedback.

True, supporting the alternative mpi is not our highest priority. Though I often find our upstream developers cursing at openmpi. They do that every 2 months or so in different upstream projects. We can consider mpich a "wishlist" issue. As you point out it takes more resources to support, and our time is limited.

Drew


Reply to: