[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

libports and interrupted RPCs



I continue to use the same stress-ng test case and the only system 'freeze' that I now see regularly (on non-rumpdisk systems) is triggered during processing of system call interruptions. This arises when the stress-ng processes near termination and receive signals. The test case generates a lot of interaction between stress-ng and ext2fs and many of the processes have an outstanding RPC at the time the signal arrives.

In such cases, the signal handling can send an 'interrupt_operation' RPC to the same port that the stress-ng sent its RPC request and the ext2fs server starts a new thread to process this additional request. ext2fs then calls hurd_thread_cancel() for each thread that is processing an RPC received on that port. But that in turn can find active RPCs sent to other servers (eg. storeio) which are also interrupted with an 'interrupt_operation'.

This sequence of hurd_thread_cancel() calls all occur whilst a single process wide mutex is held locked (see libports:interrupt_rpcs.c). The same lock is also required to begin or end other RPCs on other ports and so they must wait until the initial interrupt_operation completes. That takes such a comparatively long time that sometimes so many threads throughout the system wait so long that a kind of system deadlock occurs as they are all wanting service (mostly page in from external memory objects) from the same task (ext2fs).

I think that the calls to hurd_thread_cancel() need to be moved outside of the mutex_locked regions. Doing that does potentially change the timing of events compared to the existing code but there can be no certainty about when an interrupt_operation might arrive anyway. For example, the RPC it should be interrupting might already have completed by the time the interrupt_operation RPC has been received. I don't think that any synchronisation is required to preserve the current processing order. The main changed required is to add new locking logic to safely coordinate access to the rpc_info structures after the locking has been released and reacquired.

I have a partial solution to this which I implemented hastily. I didn't want to spend too much time before I knew whether it was very effective or not. It does seem to be very effective as my test case ran without fault for over 20 hours before I stopped it (I don't think I've achieved 8 hours before). Provided that nobody screams in horror at this approach, I'll now implement the solution properly and retest. There are about 5 or 6 similar cases where all RPCs on a port are cancelled within the global lock so they might need altering too (but they don't figure in this test case seemingly).



Reply to: