Re: libports and interrupted RPCs
Michael Kelly, le ven. 12 sept. 2025 19:27:00 +0100, a ecrit:
> On 08/09/2025 21:58, Samuel Thibault wrote:
> > Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit:
> > > The changes I suggest do not access the list in this way after the mutex has
> > > been released. The next iteration restarts the scan from the (possibly new)
> > > head of the list.
> > Ah, it restarts over on each cancellation. That's really bad asymptotic
> > quadratic complexity. That will hit us sooner or later.
>
> Well, yes, I must agree with you. I had assumed a very small list length of
> the order of 10 or so but clearly the length is potentially limited only by
> memory resources so the total number of iterations grows rapidly as the
> length increases. Fair point. I did actually address this issue by managing
> additional list pointers within rpc_info to maintain a separate
> 'cancellation list'. I then however concluded that the whole approach was
> flawed. For example, 2 separate threads initiate an interrupt_operation on
> the same remote port. On the remote side, one thread would capture the RPCs
> to cancel whilst a 2nd thread could soon commence (when _ports_lock is
> released), find no RPCs to cancel and complete before the first group were
> actually cancelled. This doesn't seem right and in any case is a change in
> behaviour.
>
> I think what is needed is a feature to guarantee the current sequential
> behaviour per port_info but which also permits genuine concurrency across
> different port_info. That isn't really true at the moment because of the
> usage of the global _ports_lock. This could possibly be a port_info specific
> mutex or perhaps an extension of the 'blocking' flags. I'm investigating
> those options currently but it seems difficult to ensure locking order to
> prevent deadlock.
I don't think we need a strict sequential guarantee.
Why not recording the port_info* pointers in a local array? This
requires extending their life in case of a concurrent termination of
the RPC, but that can be done by adding a reference counter and some
port_info_ref/_unref helpers that maintain it, and ports_end_rpc can
wait for the counter to lower to zero.
> > > I don't understand the suggestion about not re-cancelling a thread already
> > > in cancellation due to a signal.
> > I'm not saying only about signals, but also about interruption:
> > our issue is that ports_interrupt_rpcs calls hurd_thread_cancel
> > which cancels the thread, and for that calls _hurdsig_abort_rpcs
> > which might get stuck inside the __interrupt_operation() call. If
> > in hurd_thread_cancel we check ss->cancel and avoid calling
> > _hurdsig_abort_rpcs again, we won't call __interrupt_operation() again
> > and get stuck there.
> >
> > > That occurs within the originating client but isn't the storm of
> > > interruptions being generated on the server side?
> > On the server side there can be a cascade of interruptions too, yes, but
> > at least it wouldn't pile cancellations.
>
> I tested your idea of not re-cancelling a thread by wrapping the code from
> where the 'cancel' state is set to 1 to after the call to the cancellation
> hook with a guard of if (ss->cancel != 1). To capture a record of this
> 'needless cancellation' I dumped some debug output using mach_print (ext2fs
> doesn't seem to report on stderr). It took 3.5 hours of my test case before
> the system locked with the scenario involving a call to
> interrupt_operation() on ext2fs which then calls interrupt_operation on
> another task (as reported in earlier messages). This was using released hurd
> source code without my alterations.
>
> During the test run, there were 1825 needless cancellations (2 (term), 362
> (ext2fs), 1439 (proc), 22 (storeio)).
Ok, that's not that much actually.
> I think that this is an above average 'time to failure' but I'd have to make
> a statistically relevant number of runs for this to mean much. It does
> however seem to show that there is still a need to address the issue of
> maintaining _ports_lock whilst calling hurd_thread_cancel().
If the traces still show that the threads are stuck in
interrupt_operation, indeed.
Samuel
Reply to: