Hi Kari,
What's the exact condition for a package to be picked by the CI? I saw that haskell-futhark showed up on it even before I had any autopkgtest files defined. I'm thinking of packaging futhark-benchmarks next and have them run as well and I'd like to know what I'd need to do in debian/control to get things rolling. Would a recommend on futhark alone do it, via some transitive magic? futhark-benchmark (not yet even ITP'd) would be a bunch of Futhark source files to be placed under /usr/src/.
Christian is the expert on this, but it shouldn't be too hard to look up how it works. This is controlled by the debci-scheduler [1], using a configuration file that is stored in rocm-team-infra [2]. The debci-scheduler documentation states:
> To activate scheduling, an administrator creates a
/etc/debci/scheduler.conf file [...]. It contains general
configuration directives and a list of packages. This list of
packages is called the "Wanted List". [....] Jobs are scheduled
for all reverse dependencies of all binary
packages of a triggered Wanted
package.
The lookup of the reverse dependencies is implemented using
python3-apt [3].
If you're curious about it, the results are at https://ci.rocm.debian.net/packages/h/haskell-futhark/. I'm not worrying about them failing for now, I'll refine the tests with later uploads. I have at least one command line flag to try that upstream suggested to use for them. The important part is that the HIP tests are succeeding on at least one architecture, like with https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1032/39566/ Looks like Futhark's tests are good at stress testing the drivers and HSA layer. It has a lot of small tests that a GPU should have no trouble with running in parallel with little memory use. Like for example with https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1035/39659/ where one test got an error like "Memory access fault by GPU node-1 (Agent handle: 0x55b623418c20) on address 0x7fa60a57a000. Reason: Page not present or supervisor privilege."
While it's possible that is an error in the ROCm libraries or driver, this appears to be an out-of-bounds write. It's the sort of error that you'd get if you wrote code that didn't check for allocation failure. A common pattern would be hipMalloc failing (e.g., due to out-of-memory), but the code not checking the return value and using the returned pointer as if the allocation succeeded.
The ROCm components are certainly not flawless. In fact, it's not that difficult to find ways to overwhelm them. That is especially true for hardware that's not officially supported by AMD for use with ROCm. I'd just also be on the lookout for a mistake in the error handling of the calling application.
And I had a GPU hang with https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1011/39621/ [....] Is there some way to define a custom timeout for the CI run? The gfx1011 test I linked above took 9 hours and this is embarassing. Even 2 hours maximum would be excessive for these under any circumstances.
Yes. Christian recently implement a '--timeout-test-nogprogress' option for autopkgtest, so we can stop the test after no new output has been received for a few minutes [5]. I'll enable that on my gfx1011 test system.
I need the 9 hour timeout because rocfft sometimes takes that long. The rocfft tests run slowly in podman for reasons that are not entirely clear to me. I suspect it's because rocfft does more IPC and file IO than most other libraries due to its use of HIP RTC. There's probably something in the podman security model that is introducing CPU overhead for that workload.
Currently, I have enabled three backends for Futhark's tests: multicore (CPU only), OpenCL+POCL (CPU only) and HIP. The CPU only tests are valid as such but I find it doubtful how useful running them on these machines is. I think I could make them skippable and do so on a ROCm CI environment. Is there a way to detect that it's running on one? Simply reversing the /dev/kfd check seems wrong to me.
Interesting point. I don't think we have a good mechanism for this yet.
Any suggestions on how to locally test autopkgtest scripts? I tried it with an sbuild setup and that didn't have HSA available in it with no relevant dev files defined.
I'm not sure how to do it with a local package version, but you
should probably use the podman+rocm autopkgtest executor from
pkg-rocm-tools.
I copied over some artifact gathering and the /dev/kfd skip test from other HIP tests but I'm not liking this code duplication. Could we put it in /usr/share/rocm/autopkgtest/
Christian consolidated this functionality in rocm-test-launcher, which is a part of the pkg-rocm-tools package currently in NEW [4].
I'll wrap this up with a motivating example of what Futhark is good for. I have a toy program that computes force directed graphs for https://piperka.net/map/. Basically it's an ad hoc O(n^2) n-body simulation in 2d space. I have a small C program that does the work and I implemented the core part of it as a GPU program with Futhark like this: https://gitlab.com/piperka/forcelayout/-/tree/tmp/futhark-not-yet-working Don't mind the branch name, it's working after the bugfix commit. If someone reads this in the future I may have deleted the branch but the code will either be in master or some other branch then. This was my first serious use of Futhark and moving to use it was simple enough for an experienced Haskell coder like me (not a too uncommon skill). My GPU is nothing too fancy (a W6600) and my Futhark version ran under 10s compared to the 24s of my original CPU version (on a Ryzen 9 7900X). There's a Python interface too I haven't tested. I know LLMs have stolen all the hype and but I like to have this option available in Debian.
Nice results. It's great to see that this a useful package.
Sincerely,
Cory Bloor