Bug#1092662: rocsolver: improve build parallelism Inbox

To: Aron Xu <aron@debian.org>, 1092662@bugs.debian.org
Subject: Bug#1092662: rocsolver: improve build parallelism Inbox
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Sat, 11 Jan 2025 12:00:59 -0700
Message-id: <[🔎] d65590c1-69f1-473c-8ae0-b065738f40b2@slerp.xyz>
Reply-to: Cordell Bloor <cgmb@slerp.xyz>, 1092662@bugs.debian.org
In-reply-to: <[🔎] CAMr=8w5-11beytodArN5n6+u6ht0-QB9HeSLwA4F2EdW0ipGFA@mail.gmail.com>
References: <[🔎] CAMr=8w5-11beytodArN5n6+u6ht0-QB9HeSLwA4F2EdW0ipGFA@mail.gmail.com> <[🔎] CAMr=8w5-11beytodArN5n6+u6ht0-QB9HeSLwA4F2EdW0ipGFA@mail.gmail.com>

Hi Aron,

On 2025-01-10 06:57, Aron Xu wrote:

When I was performing a rebuild for an upcoming transition, I noticed
that rocsolver took a lot of time because most of the time the build
takes up to 16 parallel jobs. It would be great if the build
parallelism could be improved but I have not done research on what's
the cause, it might be build system related and not easy to change.

The slowest translation units are those containing the specializedkernels for small matrix sizes. These kernels are templated on aninteger parameter, N, representing the size of the matrix. They areexplicitly instantiated largely for loop unrolling.

The compiler builds these templated functions for all possiblecombinations of N=0...64, data type=single,double,complex,doublecomplex,gpu_arch=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx1010,gfx1030,gfx1100,gfx1101,gfx1102.Incidentally, these kernels represent ~95% of the size on disk oflibrocsolver.so.

If I recall correctly, the parallelism in this process is limited to thedata_type. Separate translation units were created for different datatypes to increase the possible parallelism. As an upstream developer ofrocSOLVER, there are a few ways in which I would like to see this improved:

1. Support for a HIP equivalent to CUDA_SEPARABLE_COMPILATION withinCMake [1]. This would enable the build system to manage the invokationof the compiler for each GPU architecture. As it stands now, whenbuilding for multiple GPU architectures, clang invokes itself multipletimes in serial and then invokes the bundler to combine the resultingartifacts. If this was managed by the build system, you could have a 10xincrease in parallelism.

It should be noted that the AMD fork of clang has a flag called-parallel-jobs that allows clang to invoke itself in parallel whenbuilding for multiple architectures. Unfortunately, this is a flawedsolution. The clang job count is multiplicative with the make job countand this can result in resource exhaustion in the parts of the buildwith the greatest make-managed parallelism. As such, you're forced toset -parallel-jobs to a relatively low value, which needlessly limitsparallelism during the parts of the build with the least make-managedparallelism.

2. If the small matrix size functions in rocsolver could be rewritten todepend on kernels that operated on blocks of fixed sizes, then perhapsthe kernels could be instantiated for something like N=1,2,4,8,16,32,then N=1...64 could just use a combination of those other sizes.Unfortunately, previous attempts to do this failed because theyintroduced unacceptable performance regressions.

3. The specialized small matrix kernels could be split out oflibrocsolver.so and into separate code objects. The rocsolver librarycould then manage the build of those code objects itself within itsCMake, which would allow for parallel compilation by GPU architecture.This option might also be nice because the rocsolver library itselfwould be ~95% smaller if it only contained the generic kernels, and thesize-specialized kernels were moved to separate files to be loaded atruntime (if available).

4. The Debian build could ask CMake to generate Ninja build files ratherthan Make build files. If build with ninja, the librocsolver,rocsolver-test and rocsolver-bench sources would be compiled in paralleldespite the latter depending on the former. This would reduce the numberof parallelism bottlenecks, it may result in the librocsolver librarybeing linked during the compilation of other sources, which wouldincrease the maximum amount of memory required for the build. There is,however, a patch that could be used as a workaround [2].

5. Once LLVM's generic targets and SPIR-V targets are supported by theHIP Runtime, we could adopt them to reduce the number of GPU targets weneed to build for. This doesn't actually increase parallelism, but itwould at least reduce the build time.


Sincerely,
Cory Bloor

[1]: https://gitlab.kitware.com/cmake/cmake/-/issues/23210
[2]: https://github.com/ROCm/rocSOLVER/pull/652

Reply to:

References:
- Bug#1092662: rocsolver: improve build parallelism Inbox
  - From: Aron Xu <aron@debian.org>

Prev by Date: Bug#1088901: marked as done (onnxruntime: Python support)
Next by Date: Processed: Re: Bug#1092802: Autopkgtests fail with Python 3.13: AttributeError: 'NoneType' object has no attribute 'span'
Previous by thread: Bug#1092662: rocsolver: improve build parallelism Inbox
Next by thread: Bug#1092666: ITP: safetensors -- Simple, safe way to store and distribute tensors
Index(es):
- Date
- Thread