Re: Helping with ROCm packaging
On Sat, Nov 23, 2024 at 07:52:10AM -0700, Cordell Bloor wrote:
> To contribute to the migration to ROCm 6.1, I would suggest (in no
> particular order):
I may have misread the LLVM discussion about 6.1/6.2 situation from
the other thread. But no matter, I'm available.
> 2. Update rocthrust and hipcub from ROCm 5.7 to ROCm 6.1 or later.
I'll start with these.
> To contribute to the packaging of pytorch-rocm, I would suggest:
> 1. Complete the packaging of hipblaslt (or summarize the remaining work so I
> can pick it up).
Ooh, this one again. Well, I started it and if it's still up in the
air I can get back to it. But...
> > I have gfx1032 and gfx1036 available (not expecting everyone to
> > remember about my Dimgrey Cavefish).
>
> There is also a gfx90a system hosted by the Oregon Advanced Computing
> Institute for Science and Society that is available to you. Feel free to
> reach out to me privately if you need access (e.g., for testing hipblaslt)
> and do not yet have an account.
... Yeah, the lack of HW was where I got stuck last time. I've no
access to their system yet. I CCd you.
> I'm also a bit curious if anything works on gfx1036 on Linux 6.10 or later.
> If you want to run the tests from rocrand1-tests or hipsolver0-tests, I'd be
> curious to see the results.
It's not 100% I guess but it looks much better than "if anything
works" to me.
$ lscpu |grep name
Model name: AMD Ryzen 9 7900X 12-Core Processor
$ uname -a
Linux sammakko4 6.11.6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.11.6-1 (2024-11-04) x86_64 GNU/Linux
No problems with hipsolver.
[==========] 9408 tests from 80 test suites ran. (29513 ms total)
[ PASSED ] 9408 tests.
hipSOLVER version 1.7.0.
However, rocrand had some issues (edited log, failures only):
test/test_rocrand_kernel_lfsr113.cpp:342: Failure
Expected: (v) > (0.0f), actual: 0 vs 0
[ FAILED ] rocrand_kernel_lfsr113.rocrand_uniform_range (195 ms)
test/test_rocrand_kernel_lfsr113.cpp:373: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[ FAILED ] rocrand_kernel_lfsr113.rocrand_uniform_double_range (223 ms)
test/test_rocrand_kernel_mrg.cpp:367: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[ FAILED ] rocrand_kernel_mrg/0.rocrand_uniform_double_range, where TypeParam = rocrand_device::mrg31k3p_engine (236 ms)
test/test_rocrand_kernel_mrg.cpp:367: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[ FAILED ] rocrand_kernel_mrg/1.rocrand_uniform_double_range, where TypeParam = rocrand_device::mrg32k3a_engine (246 ms)
test/test_rocrand_kernel_threefry4x32_20.cpp:357: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[ FAILED ] rocrand_kernel_threefry4x32_20.rocrand_uniform_double_range (223 ms)
I was half expecting to see malloc errors but these were something
else. Or maybe they were malloc errors in disguise for all I know.
This should be a FAQ but what was the trick to lift the 0.5GB limit?
I think it's been mentioned on the list but I couldn't find it again.
I used "export HIP_VISIBLE_DEVICES=1" and verified highly
scientifically with "radeontop -b 11" that the right device's bars
bopped when I ran them.
Reply to: