[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: CI: Updates to the scheduler



Hey Cory,

On 2024-03-25 19:45, Cordell Bloor wrote:
> Thank you for the work you've done on the scheduler and the Debian ROCm
> CI. In particular, the podman backend you added recently has allowed us
> to add reliable gfx803, gfx1010 and gfx1035 machines, some of which have
> helped us to catch and fix serious regressions during the ROCm 5.7.1
> update (e.g. #1065410).

glad this helps!

> On 2024-03-24 11:03, Christian Kastner wrote:
> I saw something odd after I uploaded a fix for rocprim on gfx1031 in
> rocprim. The tests were triggered by the upload of rocprim 5.7.1-2~exp2,
> but they installed librocprim-tests amd64 5.7.1-2~exp1 [3]. This doesn't
> seem to be a new phenomena, as it seems that 5.7.1-1 was tested when
> 5.7.1-2~exp1 was uploaded last week.

experimental has shown to be a bit of a pain. I've seen temporarily
inconsistent states in the archive that are technically valid but cause
problems for us.

For example, I've seen Sources indexes listing N binaries, where
Packages indexes only lists some of those N binaries. This was fixed [5]
by simply skipping those packages until all N binaries were present.

In this particular case, apparently the Sources index contained the
~exp2 version which triggered a test, but the Packages index still
listed the ~exp1 versions. And debci & autopkgtest pin packages by
release, not by version, so when they found the older versions in
experimental, they didn't consider this an error.

I've filed [6] to track this.

Another issue is the use of http://deb.debian.org as a mirror, which
redirects to the geographically closest mirror. Mirrors aren't always in
perfect sync, so your worker in Canada could see different indexes then
my worker in Austria, or CI's in Frankfurt. I just checked and rocprim
is still on ~exp1 on my closest mirror, for example.

I need to think about this second problem some more.

> I also notice that the rocthrust tests were triggered despite that they
> do not depend on any rocprim package at run-time [4]. I suppose the
> scheduler needs to trace the dependency tree via the source packages for
> reasons you've discussed in the past, but it's a good case study to
> highlight the value of further improvements to the scheduler.

This looks like the scheduler did not honor the restriction to check
only the -tests packages. (libhipcub-dev and librocthrust-dev depend on
librocprim-dev, and both hipcub and rocthrust were triggered).

Filed as [7].

Thank you for reporting the issues.

Best,
Christian

> [1]: https://ci.rocm.debian.net/packages/r/rocblas/
> [2]: https://ci.rocm.debian.net/packages/r/rocrand/
> [3]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1031/r/rocprim/10204/log.gz
> [4]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1031/r/rocthrust/10216/log.gz 

[5] https://salsa.debian.org/rocm-team/rocm-dev-tools/-/issues/11
[6] https://salsa.debian.org/rocm-team/rocm-dev-tools/-/issues/14
[7] https://salsa.debian.org/rocm-team/rocm-dev-tools/-/issues/15


Reply to: