Re: Futhark on ROCm CI

To: Kari Pahula <kaol@debian.org>, debian-ai@lists.debian.org
Subject: Re: Futhark on ROCm CI
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Thu, 7 Nov 2024 00:45:30 -0700
Message-id: <[🔎] f417e66b-9372-4a5a-91ff-888ddec8bed1@slerp.xyz>
In-reply-to: <[🔎] ZyiI4YeZWdK38dr8@sammakko4.piperka.net>
References: <[🔎] ZyiI4YeZWdK38dr8@sammakko4.piperka.net>

Hi Kari,

On 2024-11-04 01:42, Kari Pahula wrote:

What's the exact condition for a package to be picked by the CI?  I
saw that haskell-futhark showed up on it even before I had any
autopkgtest files defined.  I'm thinking of packaging
futhark-benchmarks next and have them run as well and I'd like to know
what I'd need to do in debian/control to get things rolling.  Would a
recommend on futhark alone do it, via some transitive magic?
futhark-benchmark (not yet even ITP'd) would be a bunch of Futhark
source files to be placed under /usr/src/.

Christian is the expert on this, but it shouldn't be too hard to look up how it works. This is controlled by the debci-scheduler [1], using a configuration file that is stored in rocm-team-infra [2]. The debci-scheduler documentation states:

> To activate scheduling, an administrator creates a /etc/debci/scheduler.conf file [...]. It contains general configuration directives and a list of packages. This list of packages is called the "Wanted List". [....] Jobs are scheduled for all reverse dependencies of all binary packages of a triggered Wanted package.

The lookup of the reverse dependencies is implemented using python3-apt [3].

If you're curious about it, the results are at
https://ci.rocm.debian.net/packages/h/haskell-futhark/.  I'm not
worrying about them failing for now, I'll refine the tests with later
uploads.  I have at least one command line flag to try that upstream
suggested to use for them.  The important part is that the HIP tests
are succeeding on at least one architecture, like with
https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1032/39566/

Looks like Futhark's tests are good at stress testing the drivers and
HSA layer.  It has a lot of small tests that a GPU should have no
trouble with running in parallel with little memory use.  Like for
example with
https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1035/39659/
where one test got an error like "Memory access fault by GPU node-1
(Agent handle: 0x55b623418c20) on address 0x7fa60a57a000. Reason: Page
not present or supervisor privilege."

While it's possible that is an error in the ROCm libraries or driver, this appears to be an out-of-bounds write. It's the sort of error that you'd get if you wrote code that didn't check for allocation failure. A common pattern would be hipMalloc failing (e.g., due to out-of-memory), but the code not checking the return value and using the returned pointer as if the allocation succeeded.

The ROCm components are certainly not flawless. In fact, it's not that difficult to find ways to overwhelm them. That is especially true for hardware that's not officially supported by AMD for use with ROCm. I'd just also be on the lookout for a mistake in the error handling of the calling application.

And I had a GPU hang with
https://ci.rocm.debian.net/packages/h/haskell-futhark/unstable/amd64+gfx1011/39621/

[....]

Is there some way to define a custom timeout for the CI run?  The
gfx1011 test I linked above took 9 hours and this is embarassing.
Even 2 hours maximum would be excessive for these under any
circumstances.

Yes. Christian recently implement a '--timeout-test-nogprogress' option for autopkgtest, so we can stop the test after no new output has been received for a few minutes [5]. I'll enable that on my gfx1011 test system.

I need the 9 hour timeout because rocfft sometimes takes that long. The rocfft tests run slowly in podman for reasons that are not entirely clear to me. I suspect it's because rocfft does more IPC and file IO than most other libraries due to its use of HIP RTC. There's probably something in the podman security model that is introducing CPU overhead for that workload.

Currently, I have enabled three backends for Futhark's tests:
multicore (CPU only), OpenCL+POCL (CPU only) and HIP.  The CPU only
tests are valid as such but I find it doubtful how useful running them
on these machines is.  I think I could make them skippable and do so
on a ROCm CI environment.  Is there a way to detect that it's running
on one?  Simply reversing the /dev/kfd check seems wrong to me.

Interesting point. I don't think we have a good mechanism for this yet.

Any suggestions on how to locally test autopkgtest scripts?  I tried
it with an sbuild setup and that didn't have HSA available in it with
no relevant dev files defined.

I'm not sure how to do it with a local package version, but you should probably use the podman+rocm autopkgtest executor from pkg-rocm-tools.

I copied over some artifact gathering and the /dev/kfd skip test from
other HIP tests but I'm not liking this code duplication.  Could we
put it in /usr/share/rocm/autopkgtest/

Christian consolidated this functionality in rocm-test-launcher, which is a part of the pkg-rocm-tools package currently in NEW [4].

I'll wrap this up with a motivating example of what Futhark is good
for.  I have a toy program that computes force directed graphs for
https://piperka.net/map/.  Basically it's an ad hoc O(n^2) n-body
simulation in 2d space.  I have a small C program that does the work
and I implemented the core part of it as a GPU program with Futhark
like this:
https://gitlab.com/piperka/forcelayout/-/tree/tmp/futhark-not-yet-working

Don't mind the branch name, it's working after the bugfix commit.  If
someone reads this in the future I may have deleted the branch but the
code will either be in master or some other branch then.

This was my first serious use of Futhark and moving to use it was
simple enough for an experienced Haskell coder like me (not a too
uncommon skill).  My GPU is nothing too fancy (a W6600) and my Futhark
version ran under 10s compared to the 24s of my original CPU version
(on a Ryzen 9 7900X).  There's a Python interface too I haven't
tested.  I know LLMs have stolen all the hype and but I like to have
this option available in Debian.

Nice results. It's great to see that this a useful package.

Sincerely,
Cory Bloor

[1]: https://salsa.debian.org/rocm-team/debci-scheduler
[2]: https://salsa.debian.org/rocm-team/rocm-team-infra/-/blob/71548aad5d72eb3cddba381f7e06141f298c649a/files/HOSTS/ci.rocm.debian.net/debci/scheduler.conf
[3]: https://salsa.debian.org/rocm-team/debci-scheduler/-/blob/3befedb711a428f9ed9ba5644f3b974597d0bc5c/bin/debci-scheduler#L229
[4]: https://ftp-master.debian.org/new/pkg-rocm-tools_0.8.1~exp1.html
[5]: https://salsa.debian.org/ci-team/autopkgtest/-/merge_requests/456

Reply to:

References:
- Futhark on ROCm CI
  - From: Kari Pahula <kaol@debian.org>

Prev by Date: tensorpipe-cuda is marked for autoremoval from testing
Next by Date: Re: RFS: rocblas/5.5.1+dfsg-7~exp1 -- ROCm library for basic linear algebra
Previous by thread: Futhark on ROCm CI
Next by thread: Re: Futhark on ROCm CI
Index(es):
- Date
- Thread