Re: libports and interrupted RPCs

To: Michael Kelly <mike@weatherwax.co.uk>
Cc: debian-hurd@lists.debian.org
Subject: Re: libports and interrupted RPCs
From: Samuel Thibault <sthibault@debian.org>
Date: Tue, 16 Sep 2025 01:00:19 +0200
Message-id: <[🔎] aMiag_tssZlN3ALS@begin>
Mail-followup-to: Michael Kelly <mike@weatherwax.co.uk>, debian-hurd@lists.debian.org
In-reply-to: <[🔎] 188e2c84-6526-4491-952f-a89f473db8f0@weatherwax.co.uk>
References: <9727ec8d-37dd-4649-bdc0-89c4f5821de9@weatherwax.co.uk> <aLTC1R9H4DZVSetS@begin> <[🔎] cd260a32-1b2c-4658-be2d-27b077758379@weatherwax.co.uk> <[🔎] aLjJaXWV5b56QJYC@begin> <[🔎] fcee31ad-bddf-4a30-b891-62a7a7bb8c95@weatherwax.co.uk> <[🔎] aLzBxrs1mYcllHDZ@begin> <[🔎] 7c79e2d2-2f92-4b0c-a2bb-d2e989c56b7b@weatherwax.co.uk> <[🔎] aL9DhorV3PH_IYFf@begin> <[🔎] 188e2c84-6526-4491-952f-a89f473db8f0@weatherwax.co.uk>

Michael Kelly, le ven. 12 sept. 2025 19:27:00 +0100, a ecrit:
> On 08/09/2025 21:58, Samuel Thibault wrote:
> > Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit:
> > > The changes I suggest do not access the list in this way after the mutex has
> > > been released. The next iteration restarts the scan from the (possibly new)
> > > head of the list.
> > Ah, it restarts over on each cancellation. That's really bad asymptotic
> > quadratic complexity. That will hit us sooner or later.
> 
> Well, yes, I must agree with you. I had assumed a very small list length of
> the order of 10 or so but clearly the length is potentially limited only by
> memory resources so the total number of iterations grows rapidly as the
> length increases. Fair point. I did actually address this issue by managing
> additional list pointers within rpc_info to maintain a separate
> 'cancellation list'. I then however concluded that the whole approach was
> flawed. For example, 2 separate threads initiate an interrupt_operation on
> the same remote port. On the remote side, one thread would capture the RPCs
> to cancel whilst a 2nd thread could soon commence (when _ports_lock is
> released), find no RPCs to cancel and complete before the first group were
> actually cancelled. This doesn't seem right and in any case is a change in
> behaviour.
> 
> I think what is needed is a feature to guarantee the current sequential
> behaviour per port_info but which also permits genuine concurrency across
> different port_info. That isn't really true at the moment because of the
> usage of the global _ports_lock. This could possibly be a port_info specific
> mutex or perhaps an extension of the 'blocking' flags. I'm investigating
> those options currently but it seems difficult to ensure locking order to
> prevent deadlock.

I don't think we need a strict sequential guarantee.

Why not recording the port_info* pointers in a local array? This
requires extending their life in case of a concurrent termination of
the RPC, but that can be done by adding a reference counter and some
port_info_ref/_unref helpers that maintain it, and ports_end_rpc can
wait for the counter to lower to zero.

> > > I don't understand the suggestion about not re-cancelling a thread already
> > > in cancellation due to a signal.
> > I'm not saying only about signals, but also about interruption:
> > our issue is that ports_interrupt_rpcs calls hurd_thread_cancel
> > which cancels the thread, and for that calls _hurdsig_abort_rpcs
> > which might get stuck inside the __interrupt_operation() call. If
> > in hurd_thread_cancel we check ss->cancel and avoid calling
> > _hurdsig_abort_rpcs again, we won't call __interrupt_operation() again
> > and get stuck there.
> > 
> > > That occurs within the originating client but isn't the storm of
> > > interruptions being generated on the server side?
> > On the server side there can be a cascade of interruptions too, yes, but
> > at least it wouldn't pile cancellations.
> 
> I tested your idea of not re-cancelling a thread by wrapping the code from
> where the 'cancel' state is set to 1 to after the call to the cancellation
> hook with a guard of if (ss->cancel != 1). To capture a record of this
> 'needless cancellation' I dumped some debug output using mach_print (ext2fs
> doesn't seem to report on stderr). It took 3.5 hours of my test case before
> the system locked with the scenario involving a call to
> interrupt_operation() on ext2fs which then calls interrupt_operation on
> another task (as reported in earlier messages). This was using released hurd
> source code without my alterations.
> 
> During the test run, there were 1825 needless cancellations (2 (term), 362
> (ext2fs), 1439 (proc), 22 (storeio)).

Ok, that's not that much actually.

> I think that this is an above average 'time to failure' but I'd have to make
> a statistically relevant number of runs for this to mean much. It does
> however seem to show that there is still a need to address the issue of
> maintaining _ports_lock whilst calling hurd_thread_cancel().

If the traces still show that the threads are stuck in
interrupt_operation, indeed.

Samuel

Reply to:

Follow-Ups:
- Re: libports and interrupted RPCs
  - From: Michael Kelly <mike@weatherwax.co.uk>

References:
- Re: libports and interrupted RPCs
  - From: Michael Kelly <mike@weatherwax.co.uk>
- Re: libports and interrupted RPCs
  - From: Samuel Thibault <sthibault@debian.org>
- Re: libports and interrupted RPCs
  - From: Michael Kelly <mike@weatherwax.co.uk>
- Re: libports and interrupted RPCs
  - From: Samuel Thibault <sthibault@debian.org>
- Re: libports and interrupted RPCs
  - From: Michael Kelly <mike@weatherwax.co.uk>
- Re: libports and interrupted RPCs
  - From: Samuel Thibault <sthibault@debian.org>
- Re: libports and interrupted RPCs
  - From: Michael Kelly <mike@weatherwax.co.uk>

Prev by Date: Processed (with 1 error): RM: gnumach-image-1-486 and gnumach-image-1-xen-486 [amd64] -- NBS; Dropped from amd64
Next by Date: Re: Hurd rumpdisk on real hardware
Previous by thread: Re: libports and interrupted RPCs
Next by thread: Re: libports and interrupted RPCs
Index(es):
- Date
- Thread