Re: stress-ng process termination issue

To: debian-hurd@lists.debian.org
Subject: Re: stress-ng process termination issue
From: Michael Kelly <mike@weatherwax.co.uk>
Date: Thu, 7 Aug 2025 12:09:46 +0100
Message-id: <[🔎] 7f686ce7-bc09-4160-90cc-7a198f4b9e03@weatherwax.co.uk>
In-reply-to: <f9a3a037-44e9-4e02-a2a6-5480e7254f36@weatherwax.co.uk>
References: <eb9dda26-d63f-47ba-935d-4baa070f4584@weatherwax.co.uk> <1a34e2ee-637e-4740-9ceb-494019333e5b@weatherwax.co.uk> <89910661-b576-431c-8aa1-81c67b7b2c30@weatherwax.co.uk> <aIkq44y2XcH9LgRt@begin> <d7c62c9f-fe01-49d6-b5df-0146ae8cc389@weatherwax.co.uk> <58fd766b-d3c9-4162-910f-9f01e889e902@weatherwax.co.uk> <aIs6OCZNzpIuTJ0O@begin> <f9a3a037-44e9-4e02-a2a6-5480e7254f36@weatherwax.co.uk>

On 31/07/2025 13:29, Michael Kelly wrote:

I think that is possible and worth a try. The process termination iscurrently slowed significantly after the high pageout during the testrun which is why the SIGKILL is required at all. The stress-ngtermination signal sequence is SIGALRMx4, SIGTERM then SIGKILL with asmall time between signals. I could change that to SIGALRM thenSIGKILL and lower the time interval to see if I can get a SIGALRMbeing processed whilst SIGKILL is delivered without swapping havingtaken place.

I have had no success yet creating a separate test case to reproducethese circumstances.

I'd like to try and find the cause of the swapping bugs so I'll alsocontinue with the existing test case.

I had spent some time building a new test bed with the latest hurd codeusing rumpdisk only to find that swapping fails very quickly using thissetup. I've not investigated precisely why yet but that seems to me animportant area to investigate and fix. I'll look at this once I haveresolved the current problem.

I have had some minor progress however on the original test scenario. Itseems that it is not pageout/pagein that is causing the problem here butrather an assertion that is occurring within ext2fs. In my testcase theassertion actually deadlocks with mutex locking issues causing allmessage servicing threads within ext2fs to become blocked awaiting amutex. That there are no receivers available formemory_object_data_request() messages to be accepted from the kernel isthe reason that so many kernel threads become stuck in vm_fault_continue.

Anyway, the assertion in question occurs withinglibc-2.41/sysdeps/mach/hurd/mig-reply.c:__mig_dealloc_reply_port().Adding some debug to this code has shown that the existing thread localstorage had a different port to that expected by the message header butthat both are non-zero. I would think that it is more likely that thethread local storage is invalid rather than the message header but thatremains to be seen.

I am posting this message really to see if anyone is aware of issues inthe thread local storage code and whether anything similar has arisenbefore or indeed any other immediate thoughts as to where the root ofthe problem might be. Having looked at the 'tls' code I can see that anyattempts to trace/debug this might be very difficult.


Regards,

Mike.

Reply to:

Follow-Ups:
- Re: stress-ng process termination issue
  - From: Samuel Thibault <sthibault@debian.org>
- Re: stress-ng process termination issue
  - From: Samuel Thibault <sthibault@debian.org>

Prev by Date: Re: Re: Any success with real hardware?
Next by Date: Re: stress-ng process termination issue
Previous by thread: Re: Re: Any success with real hardware?
Next by thread: Re: stress-ng process termination issue
Index(es):
- Date
- Thread