Hi Christian,
including Fedora, EPEL, SUSE, and Ubuntu [1]. While Debian is not on this list, I expect that the work done for Ubuntu will be quite beneficial for the Debian effort. The packaging work for ROCm in-box on Ubuntu builds upon the packaging work done for ROCm in-boxThis is a big win for open source. But egotistically as this might sound, I found this part saddening, because in the end, Debian got left out.
AMD certainly wanted to have Debian on that list, so maybe we should talk about why it wasn't there.
Ultimately, the problem was that AMD couldn't commit to having the ROCm stack on Debian in a state in which it could be recommended as an alternative AMD-packaged ROCm stack within the next year. To get it to a state that it could be recommended to users, we'd have to get ROCm updated to the latest upstream release and package the remaining components required for PyTorch.
AMD is working towards getting ROCm to that point on Debian. The inability to commit to getting it there by a specific date is a reflection of (1) the structure of the Debian project, in which nobody but the FTP Masters can decide that something will be accepted, and (2) the fact that the Debian project is still so far away from that goal.
There are _many_ libraries that will have to go through NEW to get Debian to that point and when folks ask me when a library will be available in a Debian release, I cannot commit to a specific time. If Debian were already close to the ultimate objective, maybe that wouldn't be a big deal, but there's a lot of reviews of indeterminate length between here and there.
With all of that said, just because AMD isn't making a commitment to our users that we'll get Debian to that point within the next year doesn't mean that we won't try. I've been spending most of my time requisitioning resources for Debian and training folks within AMD on Debian tools and processes. There has been no step back in AMD's commitment to Debian, and in fact, I strongly believe that the resources that come with this commitment will help Debian quite significantly.
First: "Ubuntu builds upon [..]" is an understatement. TTBOMK, whatever ROCm support Ubuntu has now, was produced almost entirely by the initiative and the substantial work of many Debian contributors [1] over a period of three years, and with significant funding by the Project for our CI; work which then landed downstream.
This is true. I agree entirely.
I of course understand the business decision of first officially supporting Ubuntu. But in light of the above, I consider not also supporting Debian puzzingly short-sighted. Not just not because of the missed two-birds-one-stone opportunity, but because it fails to recognize the work Debian has done so far, and leveraging that for future results.
I don't think we're missing that opportunity. The actual work being done is applicable to both Debian and Ubuntu. And, I expect that once Debian has a reasonably complete and up-to-date ROCm stack, AMD will happily add it to the support list.
And we now all find ourselves in the strange situation where new packaging contributors will need to be onboarded from the Ubuntu side, whereas contributors from Debian must re-evaluate how to continue (see below).
I am also helping to onboard some AMD-hired Debian packaging
contributors from Fedora. It's rather unfortunate that Mario and
myself are the only two AMD developers that I know of that are
anywhere close to being DDs, but that's how things went. AMD did
engage with the community to try to hire DDs [2], but was
unsuccessful at finding someone eligible for direct hiring (and
therefore chose to hire someone that could be trained to
contribute to Debian packaging). Asking AMD to hire and train a
new contributor from scratch was something that a DD had suggested
to me at one point, so that didn't seem like a bad decision to me
(even if it was not the ideal outcome).
In any case, I don't yet know exactly what additional resources AMD will be bringing onboard with this announcement. Your concerns are heard, but I think it's premature to criticize the contributors being on-boarded when nobody knows who they are.
on Debian, and will feed back into Debian as well.Following from this and the fact that ROCm will be "solved" in Ubuntu, I wonder whether there is any point in doing active development work for ROCm in Debian anymore, instead of just waiting for the work in Ubuntu to be done, and then to upstream it. And I say this without any bitterness, just with brute reason: (1) We know the packaging will happen for Ubuntu, so there is little point in expending our own limited capacities, rather than to wait for that. That doesn't just avoid duplication, it reduces the risk of conflicting development paths.
I think this would be true of any solution which involved paid contributors and volunteers. Why would you volunteer to do something that other people would be paid to do otherwise? That question would be just as valid even if Ubuntu never existed and AMD were contributing exclusively to Debian.
It makes total sense to reallocate your time elsewhere. If AMD is going to dedicate the resources to building out that base, then maybe you can spend your time on more interesting things? You can add features and design elements that AMD wasn't going to add (e.g., Christian Bayle's work on the docs has probably saved the packaged HTML documentation from removal), or you can switch your efforts to interesting higher-level libraries that depend on GPU compute (e.g., the three GSoC projects for enabling ROCm in Debian Science packages and vLLM).
That doesn't sound so bad to me.
(2) We can expect that ROCm in Ubuntu will see extensive CI testing, so again, there is little point in operating, much less expanding, our own CI.
I think it is unlikely that the Ubuntu continuous integration system will have anywhere near the breadth of hardware that the Debian ROCm CI contains. And I don't think it has public logs, either. Nevertheless, there is an important reason why I think you should continue to work on the Debian Continuous Integration system: Debian should be building a vendor-neutral system for supporting accelerator architectures (Intel, NVIDIA, AMD GPUs, NPUs, and FPGAs) on the DebCI.
And weren't you working on expanding the Debian ROCm CI system to cover both Debian and Ubuntu? If the testing on Ubuntu is sufficient to reason about Debian, then wouldn't the existing testing on Debian already be sufficient to reason about Ubuntu? I remember when rocminfo would crash on startup in Ubuntu 22.04. The exact same package version worked fine on Debian, but rocr-runtime was absolutely and completely broken on Ubuntu until I SRU'd a fix for a static initialization order fiasco.
AMD will be doing internal testing for ROCm components and key applications on both Debian and Ubuntu, but I still don't think that's a replacement for the DebCI.
Again, I realize how egotistical this might sound, and I'm sorry for that. I know this is a win for open source in general, I just wish this were more of a win for Debian, given how much we contributed to this.
I think it is an enormous win for Debian. All I can ask is that you withhold judgement until we can share more details.
Sincerely, Cory Bloor
[2]: https://lists.debian.org/debian-ai/2024/06/msg00001.html[1] Not to diminish the extra Ubuntu work you did get our packages updated and synced for 24.04.