[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [s390x] lots of "User process fault: interruption code XXXX"



Hi all,

On Sun, 14 Jul 2024 09:22:32 +0200 Paul Gevers <elbrus@debian.org> wrote:
Today I restarted the s390x host of ci.d.n because I lost access.

I have been fighting with the host for several days now, and I think I finally found the culprit. Several days ago I configured the host to do:
# panic kernel on OOM
vm.panic_on_oom=1
# reboot after 10 sec on panic
kernel.panic=10

An idea I got from h01ger last year when discussing issues on arm64 and that has been working well there [1]. (See also https://www.debuntu.org/how-to-reboot-on-oom/)

However, that doesn't seem to work on our s390x host as it seems to freeze instead. Is this something known? Something I'm doing wrong (E.g. these options behaving differently on s390x)? Is this a s390x kernel bug?

Paul
PS: the package that triggers this is hisat2. If you look at it's history [2] you see that the test was always Terminated (ignoring the run from 2024-07-17), I now assume by the OOM killer. I filed bug 1076524 against hisat2 to tell them they are using an insane amount of memory on s390x.

[1] See e.g. the period around February/March 2024 on https://ci.debian.net/munin/ci-worker-arm64-11/ci-worker-arm64-11/uptime.html where a lot of reboots happened automatically.
[2] https://ci.debian.net/packages/h/hisat2/testing/s390x/

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


Reply to: