Re: stress-ng process termination issue
- To: debian-hurd@lists.debian.org
- Subject: Re: stress-ng process termination issue
- From: Michael Kelly <mike@weatherwax.co.uk>
- Date: Thu, 7 Aug 2025 12:09:46 +0100
- Message-id: <[🔎] 7f686ce7-bc09-4160-90cc-7a198f4b9e03@weatherwax.co.uk>
- In-reply-to: <f9a3a037-44e9-4e02-a2a6-5480e7254f36@weatherwax.co.uk>
- References: <eb9dda26-d63f-47ba-935d-4baa070f4584@weatherwax.co.uk> <1a34e2ee-637e-4740-9ceb-494019333e5b@weatherwax.co.uk> <89910661-b576-431c-8aa1-81c67b7b2c30@weatherwax.co.uk> <aIkq44y2XcH9LgRt@begin> <d7c62c9f-fe01-49d6-b5df-0146ae8cc389@weatherwax.co.uk> <58fd766b-d3c9-4162-910f-9f01e889e902@weatherwax.co.uk> <aIs6OCZNzpIuTJ0O@begin> <f9a3a037-44e9-4e02-a2a6-5480e7254f36@weatherwax.co.uk>
On 31/07/2025 13:29, Michael Kelly wrote:
I think that is possible and worth a try. The process termination is
currently slowed significantly after the high pageout during the test
run which is why the SIGKILL is required at all. The stress-ng
termination signal sequence is SIGALRMx4, SIGTERM then SIGKILL with a
small time between signals. I could change that to SIGALRM then
SIGKILL and lower the time interval to see if I can get a SIGALRM
being processed whilst SIGKILL is delivered without swapping having
taken place.
I have had no success yet creating a separate test case to reproduce
these circumstances.
I'd like to try and find the cause of the swapping bugs so I'll also
continue with the existing test case.
I had spent some time building a new test bed with the latest hurd code
using rumpdisk only to find that swapping fails very quickly using this
setup. I've not investigated precisely why yet but that seems to me an
important area to investigate and fix. I'll look at this once I have
resolved the current problem.
I have had some minor progress however on the original test scenario. It
seems that it is not pageout/pagein that is causing the problem here but
rather an assertion that is occurring within ext2fs. In my testcase the
assertion actually deadlocks with mutex locking issues causing all
message servicing threads within ext2fs to become blocked awaiting a
mutex. That there are no receivers available for
memory_object_data_request() messages to be accepted from the kernel is
the reason that so many kernel threads become stuck in vm_fault_continue.
Anyway, the assertion in question occurs within
glibc-2.41/sysdeps/mach/hurd/mig-reply.c:__mig_dealloc_reply_port().
Adding some debug to this code has shown that the existing thread local
storage had a different port to that expected by the message header but
that both are non-zero. I would think that it is more likely that the
thread local storage is invalid rather than the message header but that
remains to be seen.
I am posting this message really to see if anyone is aware of issues in
the thread local storage code and whether anything similar has arisen
before or indeed any other immediate thoughts as to where the root of
the problem might be. Having looked at the 'tls' code I can see that any
attempts to trace/debug this might be very difficult.
Regards,
Mike.
Reply to: