Exactly so. It also means that no other thread will be able to map the required page even after the memory_object_data_supply response has supplied the page itself (which does happen).Hello, Michael Kelly, le mar. 23 sept. 2025 15:26:45 +0100, a ecrit:3) thread0 in the worker generates a page fault. This causes page-in involving a (top) vm_object which also has a shadow object. A mapping is made between the top object/offset and a fictitious page to block other threads from attempting the same page-in until thread0 has handled the page fault. thread0 then traverses the object chain to the shadow and makes the memory_object_data_request on the shadow object/offset and blocks itself until the reply has arrived and been processed. 4) A signal is received by the process and is handled by (say) thread1. As per normal signal handling, this results in thread0 being suspended by thread1 via the system call to thread_suspend(). It can be immediately suspended because thread0 is in TH_WAIT state and is interruptible (TH_UNINT not set).Ah, the suspension also holds any kernel activity of thread0, so it won't be able to do what it promised? (paging in)
5) After thread1 has suspended thread0 it trips a page fault itself which actually requires the same page that was being paged-in by thread0. thread1 now blocks indefinitely and cannot proceed until the original page-in completes which of course it cannot as thread0 is suspended. thread0 will only be resumed by thread1 and thread1 cannot continue because of the state managed by thread0. I have some confidence that the above sequence is broadly what is happening but it's difficult to be certain. I've got to this stage by adding extra state to data structures rather than the otherwise huge volume of debug logging which normally alters the timing to the point of masking the problem anyway. In any case, I think that the scenario described above is possible and provides a good match against the evidence that I do have. I have some very vague ideas for solutions but before discussing those it would be helpful to have my analysis scrutinised for obvious error.
I've additional confidence that this is indeed what is happening
after further scrutiny of more recent tests.
I don't yet have a solution for this problem. vm_fault_page() implementation has this restriction commented:
* 2) To prevent another thread from racing us down theThis is what prevents others from completing page-in of a page that actually becomes available after thread0 is suspended. I'm quite apprehensive about how hard it might be to re-implement safely without that restriction. It seems to me that there are many inter-dependencies across areas of code throughout gnumach that are difficult to find without a very thorough knowledge. That in itself makes it hard for newcomers to contribute.
* shadow chain and entering a new page in the top
* object before we do, we must keep a busy page in
* the top object while following the shadow chain.
A method that might require less functional change would be to somehow transfer the responsibility to complete page-in to another thread although I cannot see how that could be efficiently managed.
Basically, I'm almost at a stand still on this one and could benefit from a nudge in the right direction.
All the best,
Mike.