Hi all, On Sun, 14 Jul 2024 09:22:32 +0200 Paul Gevers <elbrus@debian.org> wrote:
Today I restarted the s390x host of ci.d.n because I lost access.
I have been fighting with the host for several days now, and I think I finally found the culprit. Several days ago I configured the host to do:
# panic kernel on OOM vm.panic_on_oom=1 # reboot after 10 sec on panic kernel.panic=10An idea I got from h01ger last year when discussing issues on arm64 and that has been working well there [1]. (See also https://www.debuntu.org/how-to-reboot-on-oom/)
However, that doesn't seem to work on our s390x host as it seems to freeze instead. Is this something known? Something I'm doing wrong (E.g. these options behaving differently on s390x)? Is this a s390x kernel bug?
PaulPS: the package that triggers this is hisat2. If you look at it's history [2] you see that the test was always Terminated (ignoring the run from 2024-07-17), I now assume by the OOM killer. I filed bug 1076524 against hisat2 to tell them they are using an insane amount of memory on s390x.
[1] See e.g. the period around February/March 2024 on https://ci.debian.net/munin/ci-worker-arm64-11/ci-worker-arm64-11/uptime.html where a lot of reboots happened automatically.
[2] https://ci.debian.net/packages/h/hisat2/testing/s390x/
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature