[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Helping with ROCm packaging



On Sat, Nov 23, 2024 at 07:52:10AM -0700, Cordell Bloor wrote:
> To contribute to the migration to ROCm 6.1, I would suggest (in no
> particular order):

I may have misread the LLVM discussion about 6.1/6.2 situation from
the other thread.  But no matter, I'm available.

> 2. Update rocthrust and hipcub from ROCm 5.7 to ROCm 6.1 or later.

I'll start with these.

> To contribute to the packaging of pytorch-rocm, I would suggest:
> 1. Complete the packaging of hipblaslt (or summarize the remaining work so I
> can pick it up).

Ooh, this one again.  Well, I started it and if it's still up in the
air I can get back to it.  But...

> > I have gfx1032 and gfx1036 available (not expecting everyone to
> > remember about my Dimgrey Cavefish).
> 
> There is also a gfx90a system hosted by the Oregon Advanced Computing
> Institute for Science and Society that is available to you. Feel free to
> reach out to me privately if you need access (e.g., for testing hipblaslt)
> and do not yet have an account.

... Yeah, the lack of HW was where I got stuck last time.  I've no
access to their system yet.  I CCd you.

> I'm also a bit curious if anything works on gfx1036 on Linux 6.10 or later.
> If you want to run the tests from rocrand1-tests or hipsolver0-tests, I'd be
> curious to see the results.

It's not 100% I guess but it looks much better than "if anything
works" to me.

$ lscpu  |grep name
Model name:                           AMD Ryzen 9 7900X 12-Core Processor
$ uname -a
Linux sammakko4 6.11.6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.11.6-1 (2024-11-04) x86_64 GNU/Linux

No problems with hipsolver.

[==========] 9408 tests from 80 test suites ran. (29513 ms total)
[  PASSED  ] 9408 tests.
hipSOLVER version 1.7.0.

However, rocrand had some issues (edited log, failures only):
test/test_rocrand_kernel_lfsr113.cpp:342: Failure
Expected: (v) > (0.0f), actual: 0 vs 0
[  FAILED  ] rocrand_kernel_lfsr113.rocrand_uniform_range (195 ms)
test/test_rocrand_kernel_lfsr113.cpp:373: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[  FAILED  ] rocrand_kernel_lfsr113.rocrand_uniform_double_range (223 ms)

test/test_rocrand_kernel_mrg.cpp:367: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[  FAILED  ] rocrand_kernel_mrg/0.rocrand_uniform_double_range, where TypeParam = rocrand_device::mrg31k3p_engine (236 ms)
test/test_rocrand_kernel_mrg.cpp:367: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[  FAILED  ] rocrand_kernel_mrg/1.rocrand_uniform_double_range, where TypeParam = rocrand_device::mrg32k3a_engine (246 ms)

test/test_rocrand_kernel_threefry4x32_20.cpp:357: Failure
Expected: (v) > (0.0), actual: 0 vs 0
[  FAILED  ] rocrand_kernel_threefry4x32_20.rocrand_uniform_double_range (223 ms)

I was half expecting to see malloc errors but these were something
else.  Or maybe they were malloc errors in disguise for all I know.
This should be a FAQ but what was the trick to lift the 0.5GB limit?
I think it's been mentioned on the list but I couldn't find it again.

I used "export HIP_VISIBLE_DEVICES=1" and verified highly
scientifically with "radeontop -b 11" that the right device's bars
bopped when I ran them.


Reply to: