Re: Deadlock during page-in and thread_suspend
Hello,
Michael Kelly, le mar. 23 sept. 2025 15:26:45 +0100, a ecrit:
> 3) thread0 in the worker generates a page fault. This causes page-in
> involving a (top) vm_object which also has a shadow object. A mapping is
> made between the top object/offset and a fictitious page to block other
> threads from attempting the same page-in until thread0 has handled the page
> fault. thread0 then traverses the object chain to the shadow and makes the
> memory_object_data_request on the shadow object/offset and blocks itself
> until the reply has arrived and been processed.
>
> 4) A signal is received by the process and is handled by (say) thread1. As
> per normal signal handling, this results in thread0 being suspended by
> thread1 via the system call to thread_suspend(). It can be immediately
> suspended because thread0 is in TH_WAIT state and is interruptible (TH_UNINT
> not set).
Ah, the suspension also holds any kernel activity of thread0, so it
won't be able to do what it promised? (paging in)
> 5) After thread1 has suspended thread0 it trips a page fault itself which
> actually requires the same page that was being paged-in by thread0. thread1
> now blocks indefinitely and cannot proceed until the original page-in
> completes which of course it cannot as thread0 is suspended. thread0 will
> only be resumed by thread1 and thread1 cannot continue because of the state
> managed by thread0.
>
> I have some confidence that the above sequence is broadly what is happening
> but it's difficult to be certain. I've got to this stage by adding extra
> state to data structures rather than the otherwise huge volume of debug
> logging which normally alters the timing to the point of masking the problem
> anyway. In any case, I think that the scenario described above is possible
> and provides a good match against the evidence that I do have.
>
> I have some very vague ideas for solutions but before discussing those it
> would be helpful to have my analysis scrutinised for obvious error.
The scenario looks plausible to me indeed.
Samuel
Reply to: