[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: libports and interrupted RPCs



On 04/09/2025 00:04, Samuel Thibault wrote:
Michael Kelly, le lun. 01 sept. 2025 23:07:44 +0100, a ecrit:
You mean the _ports_lock mutex?
Yes.
Note that this mutex protects the current_rpcs list. Going through the
list without the mutex would be unsafe. Another way would be to record
the thread ports in a local array, and call hurd_thread_cancel() in a
loop after releasing the mutex. But then the threads might still die
in-between. We could add a reference to keep the port allocated, but
hurd_thread_cancel would then allocate again a signal state for dead
threads...

My solution hopefully satisfies these concerns.

I added a 'cancelling' state to rpc_info which is tested in 'ports_end_rpc()' so that it will block until 'cancelling' is false. That prevents the RPC originator thread from terminating if the RPC is being cancelled by another thread. I also added a 'cancellor' state which is the id of the thread initiating the cancellation.

Then ports_interrupt_rpcs() can:

1) Iterate current_rpcs with lock held and set 'cancellor' of each RPC to its thread port. This supports the cases where:

a) Other RPCs are initiated during times when the lock is released.

b) Other threads also call ports_interrupt_rpcs() whilst the lock is released.

2) Iterate again over current_rpcs looking for any RPC with matching 'cancellor' and mark that RPC as 'cancelling'. Release the lock for the call to hurd_thread_cancel() reacquiring it after the call. The RPC's 'rpc_info' is still valid because of 'cancelling'.

3) Reset the RPC as no longer in cancellation and repeat from 1) until there are no more RPCs to be cancelled by this thread.

This solution appears to improve my specific test case and I have not seen any further deadlocks caused by this scenario. The thread originating the RPC is blocked until the cancellation is complete however other RPCs can be started and completed during that time. It does rely on the code calling 'ports_begin_rpc()' and 'ports_end_rpc()' but that seems to be the expected usage. It also adds about 25 lines of code which I'd be reluctant to duplicate in the 6 other similar iterations around hurd_thread_cancel(). I'd want to abstract this cancellation behaviour for reuse in those cases if it was practical.

I've attached the code changes for perusal.

1) One thread (thread: 35) handling an interrupt_operation request. This
shows it making a secondary interrupt_operation RPC to a storeio task. The
port in use has a msgcount of 5 preventing immediate delivery of this
message.
Ah! interrupt_operation calls pile up... So indeed ports_interrupt_rpcs
takes some time.

But this is actually useless, one is enough for a given thread.

I wonder if in glibc's hurd_thread_cancel, we could just add an

if (!ss->cancel)

condition on the lines from ss->cancel = 1; to calling the cancel hook.
That way, if we try to cancel the same thread several times, we'll just
suspend/resume it several times, and not call interrupt_operation on the
server several times.

I'll have to think about this. Signal handling is very complex or at least seems so to me!

I did wonder why all RPCs were being cancelled when the signal is delivered (to stress-ng in this case) rather than just the one which will be used for the signal handler. For example, say a process has 4 RPC threads in use, a signal is received, and 1 of these threads is chosen to receive it. The signal thread suspends the chosen thread, calls thread_abort() and often then calls 'interrupt_operation' on the remote port. As a consequence the remote server will call 'hurd_thread_cancel()' on the other 3 RPCs in progress. In this particular case the original process (stress-ng) is soon being terminated, so it doesn't really matter, but is it strictly necessary to cancel the other RPCs? If 'interrupt_operation' was to somehow include the identity of the original RPC call (seqno) then the remote port could be more selective about RPC cancellation.

I don't really have a wide enough knowledge of the system to express the above with confidence so please consider it 'extremely speculative'!

Regards,

Mike.
diff -ur hurd.git.sceen.net/libports/begin-rpc.c hurd/libports/begin-rpc.c
--- hurd.git.sceen.net/libports/begin-rpc.c	2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/begin-rpc.c	2025-08-31 13:40:08.861325208 +0100
@@ -91,6 +91,8 @@
   
   /* Record that that an RPC is in progress */
   info->thread = hurd_thread_self ();
+  info->cancellor = MACH_PORT_NULL;
+  info->cancelling = FALSE;
   info->next = pi->current_rpcs;
   info->notifies = 0;
   if (pi->current_rpcs)
diff -ur hurd.git.sceen.net/libports/end-rpc.c hurd/libports/end-rpc.c
--- hurd.git.sceen.net/libports/end-rpc.c	2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/end-rpc.c	2025-08-31 13:45:12.721882570 +0100
@@ -27,6 +27,11 @@
 
   pthread_mutex_lock (&_ports_lock);
 
+  while (info->cancelling)
+    {
+      pthread_cond_wait(&_ports_block, &_ports_lock);
+    }
+
   if (info->notifies)
     _ports_remove_notified_rpc (info);
 
diff -ur hurd.git.sceen.net/libports/interrupt-rpcs.c hurd/libports/interrupt-rpcs.c
--- hurd.git.sceen.net/libports/interrupt-rpcs.c	2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/interrupt-rpcs.c	2025-09-04 07:30:43.372779057 +0100
@@ -21,6 +21,8 @@
 #include "ports.h"
 #include <hurd.h>
 
+#include <stdbool.h>
+
 void
 ports_interrupt_rpcs (void *portstruct)
 {
@@ -32,12 +34,38 @@
   
   for (rpc = pi->current_rpcs; rpc; rpc = rpc->next)
     {
-      if (rpc->thread != self)
+      if (rpc->thread != self && rpc->cancellor == MACH_PORT_NULL)
 	{
-	  hurd_thread_cancel (rpc->thread);
-	  _ports_record_interruption (rpc);
+	  rpc->cancellor = self;
+	}
+    }
+
+  bool cancelled;
+  do
+    {
+      cancelled = FALSE;
+
+      for (rpc = pi->current_rpcs; rpc; rpc = rpc->next)
+	{
+	  if (rpc->cancellor == self)
+	    {
+	      /* setting rpc->cancelling guarantees that 'rpc' won't
+		 be ended until we reset to FALSE. */
+	      rpc->cancelling = cancelled = TRUE;
+
+	      pthread_mutex_unlock (&_ports_lock);
+	      hurd_thread_cancel (rpc->thread);
+	      pthread_mutex_lock (&_ports_lock);
+
+	      _ports_record_interruption (rpc);
+	      rpc->cancellor = MACH_PORT_NULL;
+	      rpc->cancelling = FALSE;
+	      pthread_cond_broadcast(&_ports_block);
+	      break;
+	    }
 	}
     }
+  while (cancelled);
 
   pthread_mutex_unlock (&_ports_lock);
 }
diff -ur hurd.git.sceen.net/libports/ports.h hurd/libports/ports.h
--- hurd.git.sceen.net/libports/ports.h	2025-07-31 20:36:05.946918537 +0100
+++ hurd/libports/ports.h	2025-08-31 13:39:22.853889878 +0100
@@ -22,6 +22,7 @@
 #define _HURD_PORTS_
 
 #include <mach.h>
+#include <stdbool.h>
 #include <stdlib.h>
 #include <hurd.h>
 #include <hurd/ihash.h>
@@ -112,6 +113,8 @@
   struct rpc_info *next, **prevp;
   struct rpc_notify *notifies;
   struct rpc_info *interrupted_next;
+  thread_t cancellor;
+  bool     cancelling;
 };
 
 /* An rpc has requested interruption on a port notification.  */

Reply to: