Hi All,
I've been experimenting with stress-ng for some time to stress test my
hurd virtual machine. This has already exposed a few problems but here
is another. Sorry, for the long explanation, but it might be necessary
to make sense of the problem. The scenario under test goes something
like:
1) Top level supervisory process 'stress-ng' begins execution
2) It forks N times, one per stressor under test (in my case 64
times). Call these processes 'stressor'.
3) The particular tests I am running are stress-vm and stress-mmap. In
these tests each of the stressor processes forks again so that it can
be supervised and restart the test should it run out of resources.
Call these processes 'worker'.
4) Each stressor sets a timeout using alarm() and then waits for the
worker to terminate by calling waitpid().
5) The stressor SIGALRM handler sets a variable tested occasionally
within the worker. If the worker tests that variable quickly then it
exits normally. If it does not, then the stressor sends a series of
signals SIGALRM (4 times), SIGTERM then finally SIGKILL with a short
time gap between them.
The test scenario I set up uses all the vm's real memory and a certain
portion of swap. Consequently when the timeout expires, many of the
processes are paged out and they do not respond quickly which means
that many workers receive all 6 signals. Occasionally, one of the
stressor processes gets stuck within this while loop within
task_terminate ($task60.0):
while (!queue_empty(list)) {
thread = (thread_t) queue_first(list); /* thread is 0xf60f9170 and
is within the worker process */
......
thread_force_terminate(thread);
......
}
thread_force_terminate(thread) calls thread_halt(thread, TRUE) and in
this instance does very little as the the thread is already halted and
it simply increases the thread suspend_count (currently standing at
0x64c0fc8e !). The thread is not removed from the list and it is
repeatedly processed in the loop.
The thread 0xf60f9170 is in $task61 (the worker) and is the main
thread which does all the stress testing. Examining its state suggests
it is already halted with a state of 0x112
(TH_SUSP|TH_HALTED|TH_SWAPPED).
All stack traces are attached and are annotated with extra context.
I'm trying to make sense of the thread code but as it's rather complex
I thought it might save time by asking if anyone had any input to
make. In particular what do I need to look at or consider to determine
why the state has ended this way? Better yet someone might immediately
see the cause of the problem. I have a virtual machine snapshot of
this moment saved so I can easily relay any additional information
required.
There is a 2nd thread ($task60.1) in the stressor process which is
also looping but I think that is just stuck waiting for the
task_terminate() to complete. (This 2nd thread is processing a
secondary timeout setup by the stressor using alarm(1) but I don't
think that is necessarily relevant).
None of the threads in $task61 appear to be active based on their
'last updated' time reported by the kernel debugger.
Any ideas?