Bug#1092662: rocsolver: improve build parallelism Inbox
Hi Aron,
On 2025-01-10 06:57, Aron Xu wrote:
When I was performing a rebuild for an upcoming transition, I noticed
that rocsolver took a lot of time because most of the time the build
takes up to 16 parallel jobs. It would be great if the build
parallelism could be improved but I have not done research on what's
the cause, it might be build system related and not easy to change.
The slowest translation units are those containing the specialized
kernels for small matrix sizes. These kernels are templated on an
integer parameter, N, representing the size of the matrix. They are
explicitly instantiated largely for loop unrolling.
The compiler builds these templated functions for all possible
combinations of N=0...64, data type=single,double,complex,double
complex,
gpu_arch=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx1010,gfx1030,gfx1100,gfx1101,gfx1102.
Incidentally, these kernels represent ~95% of the size on disk of
librocsolver.so.
If I recall correctly, the parallelism in this process is limited to the
data_type. Separate translation units were created for different data
types to increase the possible parallelism. As an upstream developer of
rocSOLVER, there are a few ways in which I would like to see this improved:
1. Support for a HIP equivalent to CUDA_SEPARABLE_COMPILATION within
CMake [1]. This would enable the build system to manage the invokation
of the compiler for each GPU architecture. As it stands now, when
building for multiple GPU architectures, clang invokes itself multiple
times in serial and then invokes the bundler to combine the resulting
artifacts. If this was managed by the build system, you could have a 10x
increase in parallelism.
It should be noted that the AMD fork of clang has a flag called
-parallel-jobs that allows clang to invoke itself in parallel when
building for multiple architectures. Unfortunately, this is a flawed
solution. The clang job count is multiplicative with the make job count
and this can result in resource exhaustion in the parts of the build
with the greatest make-managed parallelism. As such, you're forced to
set -parallel-jobs to a relatively low value, which needlessly limits
parallelism during the parts of the build with the least make-managed
parallelism.
2. If the small matrix size functions in rocsolver could be rewritten to
depend on kernels that operated on blocks of fixed sizes, then perhaps
the kernels could be instantiated for something like N=1,2,4,8,16,32,
then N=1...64 could just use a combination of those other sizes.
Unfortunately, previous attempts to do this failed because they
introduced unacceptable performance regressions.
3. The specialized small matrix kernels could be split out of
librocsolver.so and into separate code objects. The rocsolver library
could then manage the build of those code objects itself within its
CMake, which would allow for parallel compilation by GPU architecture.
This option might also be nice because the rocsolver library itself
would be ~95% smaller if it only contained the generic kernels, and the
size-specialized kernels were moved to separate files to be loaded at
runtime (if available).
4. The Debian build could ask CMake to generate Ninja build files rather
than Make build files. If build with ninja, the librocsolver,
rocsolver-test and rocsolver-bench sources would be compiled in parallel
despite the latter depending on the former. This would reduce the number
of parallelism bottlenecks, it may result in the librocsolver library
being linked during the compilation of other sources, which would
increase the maximum amount of memory required for the build. There is,
however, a patch that could be used as a workaround [2].
5. Once LLVM's generic targets and SPIR-V targets are supported by the
HIP Runtime, we could adopt them to reduce the number of GPU targets we
need to build for. This doesn't actually increase parallelism, but it
would at least reduce the build time.
Sincerely,
Cory Bloor
[1]: https://gitlab.kitware.com/cmake/cmake/-/issues/23210
[2]: https://github.com/ROCm/rocSOLVER/pull/652
Reply to: