[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

endeavour: gfx1101 worker disabled for now



Hi,

I had to disable the gfx1101 worker a few days ago because its GPU
frequently hung on tests, and as the gpuenv-auto-uptime service
periodically reboots hosts when an unresponsive GPU is detected, this
interfered with the gfx1100 worker in that a long test (eg: hipfft)
would never complete in time on gfx1100 before the gfx1101 failure
triggered a reboot.

And because queues are processed sequentially, once such a pathological
case was entered, the whole process repeated after reboot.

I'll disable scheduling gfx1101 for now, and clear out the queue at some
point.

Best,
Christian


Reply to: