On 08/09/2025 21:58, Samuel Thibault wrote:
Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit:The changes I suggest do not access the list in this way after the mutex has been released. The next iteration restarts the scan from the (possibly new) head of the list.Ah, it restarts over on each cancellation. That's really bad asymptotic quadratic complexity. That will hit us sooner or later.
Well, yes, I must agree with you. I had assumed a very small list length of the order of 10 or so but clearly the length is potentially limited only by memory resources so the total number of iterations grows rapidly as the length increases. Fair point. I did actually address this issue by managing additional list pointers within rpc_info to maintain a separate 'cancellation list'. I then however concluded that the whole approach was flawed. For example, 2 separate threads initiate an interrupt_operation on the same remote port. On the remote side, one thread would capture the RPCs to cancel whilst a 2nd thread could soon commence (when _ports_lock is released), find no RPCs to cancel and complete before the first group were actually cancelled. This doesn't seem right and in any case is a change in behaviour.
I think what is needed is a feature to guarantee the current sequential behaviour per port_info but which also permits genuine concurrency across different port_info. That isn't really true at the moment because of the usage of the global _ports_lock. This could possibly be a port_info specific mutex or perhaps an extension of the 'blocking' flags. I'm investigating those options currently but it seems difficult to ensure locking order to prevent deadlock.
I don't understand the suggestion about not re-cancelling a thread already in cancellation due to a signal.I'm not saying only about signals, but also about interruption: our issue is that ports_interrupt_rpcs calls hurd_thread_cancel which cancels the thread, and for that calls _hurdsig_abort_rpcs which might get stuck inside the __interrupt_operation() call. If in hurd_thread_cancel we check ss->cancel and avoid calling _hurdsig_abort_rpcs again, we won't call __interrupt_operation() again and get stuck there.That occurs within the originating client but isn't the storm of interruptions being generated on the server side?On the server side there can be a cascade of interruptions too, yes, but at least it wouldn't pile cancellations.
I tested your idea of not re-cancelling a thread by wrapping the code from where the 'cancel' state is set to 1 to after the call to the cancellation hook with a guard of if (ss->cancel != 1). To capture a record of this 'needless cancellation' I dumped some debug output using mach_print (ext2fs doesn't seem to report on stderr). It took 3.5 hours of my test case before the system locked with the scenario involving a call to interrupt_operation() on ext2fs which then calls interrupt_operation on another task (as reported in earlier messages). This was using released hurd source code without my alterations.
During the test run, there were 1825 needless cancellations (2 (term), 362 (ext2fs), 1439 (proc), 22 (storeio)).
I think that this is an above average 'time to failure' but I'd have to make a statistically relevant number of runs for this to mean much. It does however seem to show that there is still a need to address the issue of maintaining _ports_lock whilst calling hurd_thread_cancel().
Best regards, Mike.