Re: libports and interrupted RPCs
On 04/09/2025 00:04, Samuel Thibault wrote:
Michael Kelly, le lun. 01 sept. 2025 23:07:44 +0100, a ecrit:
You mean the _ports_lock mutex?
Yes.
Note that this mutex protects the current_rpcs list. Going through the
list without the mutex would be unsafe. Another way would be to record
the thread ports in a local array, and call hurd_thread_cancel() in a
loop after releasing the mutex. But then the threads might still die
in-between. We could add a reference to keep the port allocated, but
hurd_thread_cancel would then allocate again a signal state for dead
threads...
My solution hopefully satisfies these concerns.
I added a 'cancelling' state to rpc_info which is tested in
'ports_end_rpc()' so that it will block until 'cancelling' is false.
That prevents the RPC originator thread from terminating if the RPC is
being cancelled by another thread. I also added a 'cancellor' state
which is the id of the thread initiating the cancellation.
Then ports_interrupt_rpcs() can:
1) Iterate current_rpcs with lock held and set 'cancellor' of each RPC
to its thread port. This supports the cases where:
a) Other RPCs are initiated during times when the lock is released.
b) Other threads also call ports_interrupt_rpcs() whilst the lock is
released.
2) Iterate again over current_rpcs looking for any RPC with matching
'cancellor' and mark that RPC as 'cancelling'. Release the lock for the
call to hurd_thread_cancel() reacquiring it after the call. The RPC's
'rpc_info' is still valid because of 'cancelling'.
3) Reset the RPC as no longer in cancellation and repeat from 1) until
there are no more RPCs to be cancelled by this thread.
This solution appears to improve my specific test case and I have not
seen any further deadlocks caused by this scenario. The thread
originating the RPC is blocked until the cancellation is complete
however other RPCs can be started and completed during that time. It
does rely on the code calling 'ports_begin_rpc()' and 'ports_end_rpc()'
but that seems to be the expected usage. It also adds about 25 lines of
code which I'd be reluctant to duplicate in the 6 other similar
iterations around hurd_thread_cancel(). I'd want to abstract this
cancellation behaviour for reuse in those cases if it was practical.
I've attached the code changes for perusal.
1) One thread (thread: 35) handling an interrupt_operation request. This
shows it making a secondary interrupt_operation RPC to a storeio task. The
port in use has a msgcount of 5 preventing immediate delivery of this
message.
Ah! interrupt_operation calls pile up... So indeed ports_interrupt_rpcs
takes some time.
But this is actually useless, one is enough for a given thread.
I wonder if in glibc's hurd_thread_cancel, we could just add an
if (!ss->cancel)
condition on the lines from ss->cancel = 1; to calling the cancel hook.
That way, if we try to cancel the same thread several times, we'll just
suspend/resume it several times, and not call interrupt_operation on the
server several times.
I'll have to think about this. Signal handling is very complex or at
least seems so to me!
I did wonder why all RPCs were being cancelled when the signal is
delivered (to stress-ng in this case) rather than just the one which
will be used for the signal handler. For example, say a process has 4
RPC threads in use, a signal is received, and 1 of these threads is
chosen to receive it. The signal thread suspends the chosen thread,
calls thread_abort() and often then calls 'interrupt_operation' on the
remote port. As a consequence the remote server will call
'hurd_thread_cancel()' on the other 3 RPCs in progress. In this
particular case the original process (stress-ng) is soon being
terminated, so it doesn't really matter, but is it strictly necessary to
cancel the other RPCs? If 'interrupt_operation' was to somehow include
the identity of the original RPC call (seqno) then the remote port could
be more selective about RPC cancellation.
I don't really have a wide enough knowledge of the system to express the
above with confidence so please consider it 'extremely speculative'!
Regards,
Mike.
diff -ur hurd.git.sceen.net/libports/begin-rpc.c hurd/libports/begin-rpc.c
--- hurd.git.sceen.net/libports/begin-rpc.c 2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/begin-rpc.c 2025-08-31 13:40:08.861325208 +0100
@@ -91,6 +91,8 @@
/* Record that that an RPC is in progress */
info->thread = hurd_thread_self ();
+ info->cancellor = MACH_PORT_NULL;
+ info->cancelling = FALSE;
info->next = pi->current_rpcs;
info->notifies = 0;
if (pi->current_rpcs)
diff -ur hurd.git.sceen.net/libports/end-rpc.c hurd/libports/end-rpc.c
--- hurd.git.sceen.net/libports/end-rpc.c 2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/end-rpc.c 2025-08-31 13:45:12.721882570 +0100
@@ -27,6 +27,11 @@
pthread_mutex_lock (&_ports_lock);
+ while (info->cancelling)
+ {
+ pthread_cond_wait(&_ports_block, &_ports_lock);
+ }
+
if (info->notifies)
_ports_remove_notified_rpc (info);
diff -ur hurd.git.sceen.net/libports/interrupt-rpcs.c hurd/libports/interrupt-rpcs.c
--- hurd.git.sceen.net/libports/interrupt-rpcs.c 2025-07-31 20:36:05.942918581 +0100
+++ hurd/libports/interrupt-rpcs.c 2025-09-04 07:30:43.372779057 +0100
@@ -21,6 +21,8 @@
#include "ports.h"
#include <hurd.h>
+#include <stdbool.h>
+
void
ports_interrupt_rpcs (void *portstruct)
{
@@ -32,12 +34,38 @@
for (rpc = pi->current_rpcs; rpc; rpc = rpc->next)
{
- if (rpc->thread != self)
+ if (rpc->thread != self && rpc->cancellor == MACH_PORT_NULL)
{
- hurd_thread_cancel (rpc->thread);
- _ports_record_interruption (rpc);
+ rpc->cancellor = self;
+ }
+ }
+
+ bool cancelled;
+ do
+ {
+ cancelled = FALSE;
+
+ for (rpc = pi->current_rpcs; rpc; rpc = rpc->next)
+ {
+ if (rpc->cancellor == self)
+ {
+ /* setting rpc->cancelling guarantees that 'rpc' won't
+ be ended until we reset to FALSE. */
+ rpc->cancelling = cancelled = TRUE;
+
+ pthread_mutex_unlock (&_ports_lock);
+ hurd_thread_cancel (rpc->thread);
+ pthread_mutex_lock (&_ports_lock);
+
+ _ports_record_interruption (rpc);
+ rpc->cancellor = MACH_PORT_NULL;
+ rpc->cancelling = FALSE;
+ pthread_cond_broadcast(&_ports_block);
+ break;
+ }
}
}
+ while (cancelled);
pthread_mutex_unlock (&_ports_lock);
}
diff -ur hurd.git.sceen.net/libports/ports.h hurd/libports/ports.h
--- hurd.git.sceen.net/libports/ports.h 2025-07-31 20:36:05.946918537 +0100
+++ hurd/libports/ports.h 2025-08-31 13:39:22.853889878 +0100
@@ -22,6 +22,7 @@
#define _HURD_PORTS_
#include <mach.h>
+#include <stdbool.h>
#include <stdlib.h>
#include <hurd.h>
#include <hurd/ihash.h>
@@ -112,6 +113,8 @@
struct rpc_info *next, **prevp;
struct rpc_notify *notifies;
struct rpc_info *interrupted_next;
+ thread_t cancellor;
+ bool cancelling;
};
/* An rpc has requested interruption on a port notification. */
Reply to: