libports and interrupted RPCs

To: debian-hurd@lists.debian.org
Subject: libports and interrupted RPCs
From: Michael Kelly <mike@weatherwax.co.uk>
Date: Sat, 30 Aug 2025 21:29:46 +0100
Message-id: <[🔎] 9727ec8d-37dd-4649-bdc0-89c4f5821de9@weatherwax.co.uk>

I continue to use the same stress-ng test case and the only system'freeze' that I now see regularly (on non-rumpdisk systems) is triggeredduring processing of system call interruptions. This arises when thestress-ng processes near termination and receive signals. The test casegenerates a lot of interaction between stress-ng and ext2fs and many ofthe processes have an outstanding RPC at the time the signal arrives.

In such cases, the signal handling can send an 'interrupt_operation' RPCto the same port that the stress-ng sent its RPC request and the ext2fsserver starts a new thread to process this additional request. ext2fsthen calls hurd_thread_cancel() for each thread that is processing anRPC received on that port. But that in turn can find active RPCs sent toother servers (eg. storeio) which are also interrupted with an'interrupt_operation'.

This sequence of hurd_thread_cancel() calls all occur whilst a singleprocess wide mutex is held locked (see libports:interrupt_rpcs.c). Thesame lock is also required to begin or end other RPCs on other ports andso they must wait until the initial interrupt_operation completes. Thattakes such a comparatively long time that sometimes so many threadsthroughout the system wait so long that a kind of system deadlock occursas they are all wanting service (mostly page in from external memoryobjects) from the same task (ext2fs).

I think that the calls to hurd_thread_cancel() need to be moved outsideof the mutex_locked regions. Doing that does potentially change thetiming of events compared to the existing code but there can be nocertainty about when an interrupt_operation might arrive anyway. Forexample, the RPC it should be interrupting might already have completedby the time the interrupt_operation RPC has been received. I don't thinkthat any synchronisation is required to preserve the current processingorder. The main changed required is to add new locking logic to safelycoordinate access to the rpc_info structures after the locking has beenreleased and reacquired.

I have a partial solution to this which I implemented hastily. I didn'twant to spend too much time before I knew whether it was very effectiveor not. It does seem to be very effective as my test case ran withoutfault for over 20 hours before I stopped it (I don't think I've achieved8 hours before). Provided that nobody screams in horror at thisapproach, I'll now implement the solution properly and retest. There areabout 5 or 6 similar cases where all RPCs on a port are cancelled withinthe global lock so they might need altering too (but they don't figurein this test case seemingly).

Reply to:

Follow-Ups:
- Re: libports and interrupted RPCs
  - From: Samuel Thibault <sthibault@debian.org>

Prev by Date: Re: rumpdisk device timeouts
Next by Date: Re: active swapping using rumpdisk
Previous by thread: UEFI
Next by thread: Re: libports and interrupted RPCs
Index(es):
- Date
- Thread