[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: stress-ng process termination issue



On 31/07/2025 13:29, Michael Kelly wrote:

I think that is possible and worth a try. The process termination is currently slowed significantly after the high pageout during the test run which is why the SIGKILL is required at all. The stress-ng termination signal sequence is SIGALRMx4, SIGTERM then SIGKILL with a small time between signals. I could change that to SIGALRM then SIGKILL and lower the time interval to see if I can get a SIGALRM being processed whilst SIGKILL is delivered without swapping having taken place.
I have had no success yet creating a separate test case to reproduce these circumstances.
I'd like to try and find the cause of the swapping bugs so I'll also continue with the existing test case.

I had spent some time building a new test bed with the latest hurd code using rumpdisk only to find that swapping fails very quickly using this setup. I've not investigated precisely why yet but that seems to me an important area to investigate and fix. I'll look at this once I have resolved the current problem.

I have had some minor progress however on the original test scenario. It seems that it is not pageout/pagein that is causing the problem here but rather an assertion that is occurring within ext2fs. In my testcase the assertion actually deadlocks with mutex locking issues causing all message servicing threads within ext2fs to become blocked awaiting a mutex. That there are no receivers available for memory_object_data_request() messages to be accepted from the kernel is the reason that so many kernel threads become stuck in vm_fault_continue.

Anyway, the assertion in question occurs within glibc-2.41/sysdeps/mach/hurd/mig-reply.c:__mig_dealloc_reply_port(). Adding some debug to this code has shown that the existing thread local storage had a different port to that expected by the message header but that both are non-zero. I would think that it is more likely that the thread local storage is invalid rather than the message header but that remains to be seen.

I am posting this message really to see if anyone is aware of issues in the thread local storage code and whether anything similar has arisen before or indeed any other immediate thoughts as to where the root of the problem might be. Having looked at the 'tls' code I can see that any attempts to trace/debug this might be very difficult.

Regards,

Mike.


Reply to: