Re: libports and interrupted RPCs
Hello,
Michael Kelly, le jeu. 04 sept. 2025 07:43:42 +0100, a ecrit:
> On 04/09/2025 00:04, Samuel Thibault wrote:
> > Michael Kelly, le lun. 01 sept. 2025 23:07:44 +0100, a ecrit:
> > > > You mean the _ports_lock mutex?
> > > Yes.
> > Note that this mutex protects the current_rpcs list. Going through the
> > list without the mutex would be unsafe. Another way would be to record
> > the thread ports in a local array, and call hurd_thread_cancel() in a
> > loop after releasing the mutex. But then the threads might still die
> > in-between. We could add a reference to keep the port allocated, but
> > hurd_thread_cancel would then allocate again a signal state for dead
> > threads...
>
> My solution hopefully satisfies these concerns.
I was indeed wondering about a way to keep rpc_info alive for some time.
I'm a bit afraid of the added complexity.
> It does rely on the code calling 'ports_begin_rpc()' and
> 'ports_end_rpc()' but that seems to be the expected usage.
Yes, it really is expected.
> It also adds about 25 lines of code which I'd be reluctant to
> duplicate in the 6 other similar iterations around hurd_thread_cancel(). I'd
> want to abstract this cancellation behaviour for reuse in those cases if it
> was practical.
Yes, we'd clearly want to factorize
> I added a 'cancelling' state to rpc_info which is tested in
> 'ports_end_rpc()' so that it will block until 'cancelling' is false. That
> prevents the RPC originator thread from terminating if the RPC is being
> cancelled by another thread. I also added a 'cancellor' state which is the
> id of the thread initiating the cancellation.
>
> Then ports_interrupt_rpcs() can:
>
> 1) Iterate current_rpcs with lock held and set 'cancellor' of each RPC to
> its thread port. This supports the cases where:
>
> a) Other RPCs are initiated during times when the lock is released.
>
> b) Other threads also call ports_interrupt_rpcs() whilst the lock is
> released.
>
> 2) Iterate again over current_rpcs looking for any RPC with matching
> 'cancellor' and mark that RPC as 'cancelling'.
>
> Release the lock for the call to hurd_thread_cancel() reacquiring
> it after the call. The RPC's 'rpc_info' is still valid because of
> 'cancelling'.
You are not really sure what is happening with the rpc_info list while
you don't have the lock. Possibly currently it happens to be safe
because the item you are on while not move within the list, but this
looks very fragile to me. Maybe better record an array of the rpc_info
pointers that we want to cancel.
> 3) Reset the RPC as no longer in cancellation and repeat from 1) until there
> are no more RPCs to be cancelled by this thread.
We may end up in a livelock here, if somehow some other code keeps
making newer RPCs in the thread.
> > > 1) One thread (thread: 35) handling an interrupt_operation request. This
> > > shows it making a secondary interrupt_operation RPC to a storeio task. The
> > > port in use has a msgcount of 5 preventing immediate delivery of this
> > > message.
> > Ah! interrupt_operation calls pile up... So indeed ports_interrupt_rpcs
> > takes some time.
> >
> > But this is actually useless, one is enough for a given thread.
> >
> > I wonder if in glibc's hurd_thread_cancel, we could just add an
> >
> > if (!ss->cancel)
> >
> > condition on the lines from ss->cancel = 1; to calling the cancel hook.
> > That way, if we try to cancel the same thread several times, we'll just
> > suspend/resume it several times, and not call interrupt_operation on the
> > server several times.
>
> I'll have to think about this. Signal handling is very complex or at least
> seems so to me!
Signal handling is the most complex thing in Unix, really :)
> I did wonder why all RPCs were being cancelled when the signal is delivered
It's not all RPCs, just the ones on the port that the signaled thread is
waiting an RPC for. That can indeed be a lot if a lot of threads happen
to be waiting on this port. It however looks safer this way: you'd never
really know which kind of interlockign condition there might be in the
server for the various threads blocked on the port. For instance if the
server was serving a shared condition variable, you might want to make
sure that everyone has a chance to wake up, and not only the one that is
getting interrupted and might try to be doing something else.
We just want to avoid a storm of interruptions, and I believe avoiding
to cancel an already being-canceled thread can lead us way further to
that direction.
Samuel
Reply to: