stress-ng process termination issue

To: debian-hurd@lists.debian.org
Subject: stress-ng process termination issue
From: Michael Kelly <mike@weatherwax.co.uk>
Date: Tue, 22 Jul 2025 20:14:25 +0100
Message-id: <[🔎] eb9dda26-d63f-47ba-935d-4baa070f4584@weatherwax.co.uk>

Hi All,

I've been experimenting with stress-ng for some time to stress test myhurd virtual machine. This has already exposed a few problems but hereis another. Sorry, for the long explanation, but it might be necessaryto make sense of the problem. The scenario under test goes something like:


1) Top level supervisory process 'stress-ng' begins execution

2) It forks N times, one per stressor under test (in my case 64 times).Call these processes 'stressor'.

3) The particular tests I am running are stress-vm and stress-mmap. Inthese tests each of the stressor processes forks again so that it can besupervised and restart the test should it run out of resources. Callthese processes 'worker'.

4) Each stressor sets a timeout using alarm() and then waits for theworker to terminate by calling waitpid().

5) The stressor SIGALRM handler sets a variable tested occasionallywithin the worker. If the worker tests that variable quickly then itexits normally. If it does not, then the stressor sends a series ofsignals SIGALRM (4 times), SIGTERM then finally SIGKILL with a shorttime gap between them.

The test scenario I set up uses all the vm's real memory and a certainportion of swap. Consequently when the timeout expires, many of theprocesses are paged out and they do not respond quickly which means thatmany workers receive all 6 signals. Occasionally, one of the stressorprocesses gets stuck within this while loop within task_terminate($task60.0):


while (!queue_empty(list)) {

thread = (thread_t) queue_first(list); /* thread is 0xf60f9170 andis within the worker process */

    ......
    thread_force_terminate(thread);
    ......
}

thread_force_terminate(thread) calls thread_halt(thread, TRUE) and inthis instance does very little as the the thread is already halted andit simply increases the thread suspend_count (currently standing at0x64c0fc8e !). The thread is not removed from the list and it isrepeatedly processed in the loop.

The thread 0xf60f9170 is in $task61 (the worker) and is the main threadwhich does all the stress testing. Examining its state suggests it isalready halted with a state of 0x112 (TH_SUSP|TH_HALTED|TH_SWAPPED).


All stack traces are attached and are annotated with extra context.

I'm trying to make sense of the thread code but as it's rather complex Ithought it might save time by asking if anyone had any input to make. Inparticular what do I need to look at or consider to determine why thestate has ended this way? Better yet someone might immediately see thecause of the problem. I have a virtual machine snapshot of this momentsaved so I can easily relay any additional information required.

There is a 2nd thread ($task60.1) in the stressor process which is alsolooping but I think that is just stuck waiting for the task_terminate()to complete. (This 2nd thread is processing a secondary timeout setup bythe stressor using alarm(1) but I don't think that is necessarily relevant).

None of the threads in $task61 appear to be active based on their 'lastupdated' time reported by the kernel debugger.


Any ideas?

db> show task $task60
    TASK	THREADS
 60 (stress-ng(6880)) (f64256d8): 3 threads:
	      0	(e95092e8) ..S..F.
	      1	(e9521a18) R......
	      2	(f60f92e0) ..SO...(thread_bootstrap_return)

db> trace /tu $task60.0
switch_context(e95092e8,0,f597acf0,c1039b46,0)+0xa6
thread_invoke(e95092e8,0,f597acf0,c103a625)+0xcf
thread_block(0,c103bc60,8,293,eb3668a8)+0x40
task_terminate(f6425600==$task61,1aec,f132bfd8,c1032acd,d002a640)+0x155
syscall_task_terminate(9,d002a640,f597ce58,f132bec0,f132bef8)+0x1c
>>>>> user space <<<<<
syscall_task_terminate 0x1c8d67c(0x1c8d8bc(9,55c,1adf678,1c9c359,1f34808)
kill_pid 0x2e4be(0,1f34c5c,1cbda30,1c,5)
0x1f33fb0()

db> trace /tu $task60.1
switch_context(e9521a18,0,e95092e8,c1039b46,c112e650)+0xa6
thread_invoke(e9521a18,0,e95092e8,c103a625)+0xcf
thread_block(0,1,84,7)+0x40
thread_sleep(e9509300,0,1,c103d4e7,10)+0x1d
thread_dowait(e95092e8,0,1,c103da6d)+0x73
thread_halt(e95092e8,0,ed239f20,c1002358)+0x24b
thread_abort(e95092e8,e9521a18,ed239f20,c10037d5,c1125140)+0x27
_Xthread_abort(d0751010,f9ef9010,ed239f40,c10037f9,d28e9dd8)+0x2b
ipc_kobject_server(d0751000,24,0,1000)+0x93
mach_msg_trap(2735a70,3,18,20,8)+0x8ac
>>>>> user space <<<<<
mach_msg_trap 0x1c8d4ac(0x1c8dc60(2735a70,3,18,20,8)
thread_abort 0x1eefa3f(24,4b,2735ad8,1200,20)
abort_thread 0x1c9f01c(0,0,0,0,0)
post_signal 0x1ca2b67(0,2735e1c,1c8da0b,28,1ee56e8)
0x1ca42d1(1f34008,e,2735e7c,22,12)
_S_msg_sig_post 0x1ca45cf(15,22,12,e=SIGALARM,fffffffe)
0x1f28e67(2736f50,2735f40,0,1000,1f29189)
_S_msg_server 0x1f291ea(2736f50,2735f40,1c8dc3b,2736f50,1000)
0x1cbd11a(2736f50,2735f40,0,0,0)
0x1c8e178(1cbd0d0,1000,15,0,0)
0x1c8e2a4(1cbd0d0,1000,15,0,1c6de04)
0x1cbd180(

db> trace /tu $task60.2
Continuation thread_bootstrap_return
>>>>> user space <<<<<
mach_msg_trap 0x1c8d4ac(0x1c8dc60(29d2fc0,502,0,20,4)
timer_thread 0x1d725e4(

Threads in task61 appear to be inactive with just one in kernel space:

db> show task $task61
    TASK	THREADS
 61 ((stress-ng(6880)))	(f6425600): 3 threads:
	      0	(f60f9170) ..SO.F.(thread_bootstrap_return)
	      1	(f60f9a10) ..S....
	      2	(f5a71170) ..SO...(thread_bootstrap_return)

db> trace /tu $task61.0
Continuation thread_bootstrap_return
>>>>> user space <<<<<
stress_mincore_touch_pages_slow 0x2be20(7ba,0,0,12,831)
0x12()

db> trace /tu $task61.1
switch_context(f60f9a10,0,e9521a18,c1039b46,f6665328)+0xa6
thread_invoke(f60f9a10,0,e9521a18,c103a625)+0xcf
thread_block(0,1,f1389e80,f9f2cffc)+0x40
thread_sleep(f60f9188,0,1,c103d4e7,f5a71170)+0x1d
thread_dowait(f60f9170,0,1,c103da6d)+0x73
thread_halt(f60f9170,0,f1389f20,c1002358)+0x24b
thread_abort(f60f9170,f60f9a10,f1389f20,c10037d5,c1125140)+0x27
_Xthread_abort(e0835010,f9f2b010,f1389f40,c10037f9,f2ae4288)+0x2b
ipc_kobject_server(e0835000,24,0,1000)+0x93
mach_msg_trap(2735a70,3,18,20,8)+0x8ac
>>>>> user space <<<<<
mach_msg_trap 0x1c8d4ac(0x1c8dc60(2735a70,3,18,20,8)
0x1eefa3f(24,4b,2735ad8,1200,20)
abort_thread 0x1c9f01c(0,0,0,0,0)
post_signal 0x1ca2b67(0,2735e1c,1c8da0b,28,1ee56e8)
_hurd_internal_post_signal 0x1ca42d1(1f34008,e,2735e7c,6,12)
_S_msg_sig_post 0x1ca45cf(15,6,12,e,fffffffe)
_Xmsg_sig_post 0x1f28e67(2736f50,2735f40,0,1000,1f29189)
_S_msg_server 0x1f291ea(2736f50,2735f40,1c8dc3b,2736f50,1000)
msgport_server 0x1cbd11a(2736f50,2735f40,0,0,0)
mach_msg_server_timeout 0x1c8e178(1cbd0d0,1000,15,0,0)
mach_msg_server 0x1c8e2a4(1cbd0d0,1000,15,0,1c6de04)
_hurd_msgport_receive 0x1cbd180(

db> trace /tu $task61.2
Continuation thread_bootstrap_return
>>>>> user space <<<<<
mach_msg_trap 0x1c8d4ac(0x1c8dc60(4146fc0,502,0,20,4)
timer_thread 0x1d725e4(

Reply to:

Follow-Ups:
- Re: stress-ng process termination issue
  - From: Michael Kelly <mike@weatherwax.co.uk>

Prev by Date: Re: Hurd NFS translator
Next by Date: Re: stress-ng process termination issue
Previous by thread: Bug#1109724: unblock: crosshurd/1.7.64
Next by thread: Re: stress-ng process termination issue
Index(es):
- Date
- Thread