Deadlock during page-in and thread_suspend

To: debian-hurd@lists.debian.org
Subject: Deadlock during page-in and thread_suspend
From: Michael Kelly <audax@allealle.uk>
Date: Tue, 23 Sep 2025 15:26:45 +0100
Message-id: <[🔎] e9cac123-6c7f-4f89-9e19-194c986b363c@allealle.uk>

For reference, I have been running the following stress test for quitesome time now:

# stress-ng -t 2m --vm 32 --vm-bytes 750M --mmap 32 --mmap-bytes 750M--page-in

This is run on a Qemu virtual machine with 2GB of RAM and soconsequently exercises page in and out quite heavily. I am now runningthe test on 64 bit Hurd.

I have only 2 regular cases remaining where the system enters adeadlocked type state. This one and one involving rumplib repeatedlyreporting disk timeout errors. Anyway, this one relates to processtermination again and can be summarised thus:

1) I've described the process architecture of stress-ng before but therelevant part to this test is that stress-ng runs, forks a child whichforks another (worker) child. After the 2m timeout a signal is sent froma parent stress-ng to the worker to trigger it to complete processingand terminate.

2) The worker process does all the major processing which involves lotsof pageout and pagein.

3) thread0 in the worker generates a page fault. This causes page-ininvolving a (top) vm_object which also has a shadow object. A mapping ismade between the top object/offset and a fictitious page to block otherthreads from attempting the same page-in until thread0 has handled thepage fault. thread0 then traverses the object chain to the shadow andmakes the memory_object_data_request on the shadow object/offset andblocks itself until the reply has arrived and been processed.

4) A signal is received by the process and is handled by (say) thread1.As per normal signal handling, this results in thread0 being suspendedby thread1 via the system call to thread_suspend(). It can beimmediately suspended because thread0 is in TH_WAIT state and isinterruptible (TH_UNINT not set).

5) After thread1 has suspended thread0 it trips a page fault itselfwhich actually requires the same page that was being paged-in bythread0. thread1 now blocks indefinitely and cannot proceed until theoriginal page-in completes which of course it cannot as thread0 issuspended. thread0 will only be resumed by thread1 and thread1 cannotcontinue because of the state managed by thread0.

I have some confidence that the above sequence is broadly what ishappening but it's difficult to be certain. I've got to this stage byadding extra state to data structures rather than the otherwise hugevolume of debug logging which normally alters the timing to the point ofmasking the problem anyway. In any case, I think that the scenariodescribed above is possible and provides a good match against theevidence that I do have.

I have some very vague ideas for solutions but before discussing thoseit would be helpful to have my analysis scrutinised for obvious error.


Cheers,

Mike.

Reply to:

Follow-Ups:
- Re: Deadlock during page-in and thread_suspend
  - From: Samuel Thibault <sthibault@debian.org>

Prev by Date: Bug#1115365: Removed package(s) from unstable
Next by Date: Re: Deadlock during page-in and thread_suspend
Previous by thread: Bug#1115365: Removed package(s) from unstable
Next by thread: Re: Deadlock during page-in and thread_suspend
Index(es):
- Date
- Thread