Hi Clément, On 2025-07-28 07:58, LONGEAC Clement wrote:
But I have several problems with it, it makes very long times to build , so that some tests are marked as timed out whatever I do. I implemented the time limit at 42 200 second . In the ROCm parts , I have the error "Maximum valid workgroup size 256 on device <pyopencl.Device 'gfx1034' on 'AMD Accelerated Parallel Processing' at 0xe90bf90> 0.0 1.871411379818157e-05 " I don't know how to solve that and what it come from ... I made a lot of research and I don't really know how to solve it. It seems to be material , to solve it we must have a GPU AMD marked as PRO , not a gaming graphic card.
I don't think this is related to your GPU. I believe the problem is that your code is trying to launch too many threads within a workgroup when it runs a kernel.
The maximum number of threads per workgroup (or in CUDA terminology, threads per block) that a kernel can be run on at once is determined at compilation time. It is 256 by default, but can be increased to 1024 using --gpu-max-threads-per-block=1024 as the compile flag (or, you can annotate the specific kernel directly). The trade-off is that if you increase the number of threads per block, there are fewer registers available for each thread which may reduce performance.
A plausible alternative is that the code is trying to launch too many workgroups/blocks in each kernel launch, rather than just too many threads per workgroup. If that is the case, I don't know of any solution aside from fixing the code. I believe there's a limit of 256 blocks per kernel launch, as well.
Sincerely, Cory Bloor