endeavour: gfx1101 worker disabled for now
Hi,
I had to disable the gfx1101 worker a few days ago because its GPU
frequently hung on tests, and as the gpuenv-auto-uptime service
periodically reboots hosts when an unresponsive GPU is detected, this
interfered with the gfx1100 worker in that a long test (eg: hipfft)
would never complete in time on gfx1100 before the gfx1101 failure
triggered a reboot.
And because queues are processed sequentially, once such a pathological
case was entered, the whole process repeated after reboot.
I'll disable scheduling gfx1101 for now, and clear out the queue at some
point.
Best,
Christian
Reply to: