[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: stress-ng process termination issue



On 07/08/2025 13:41, Samuel Thibault wrote:
Michael Kelly, le jeu. 07 août 2025 13:07:07 +0100, a ecrit:
On 07/08/2025 12:13, Samuel Thibault wrote:
TLS code is really difficult to debug indeed. But I'd doubt that's the
issue, as it is largely used for many other cases, and the code is
actually very trivial: just reading/writing %gs:0x38.

And it's probably very easy for calls to __mig_get_reply_port() /
__mig_dealloc_reply_port () to get wrong with mismatched cleanup code,
e.g.

port = __mig_get_reply_port()
__mig_dealloc_reply_port(port);
port2 = __mig_get_reply_port()
__mig_dealloc_reply_port(port);
I do hope that you are right although I hadn't expected mismatches from
generated code.
I wouldn't indeed, but there are also signal management and a couple
explicit calls.

I have a greater understanding of how the 'assert (port == arg)' [mig-reply.c:__mig_dealloc_reply_port()] occurs. The assertion has occurred within tasks associated with storeio, ext2fs and stress-ng illustrating that it is a general problem.

In the case of ext2fs, an example occurs when it makes the RPC call to __fsys_getroot(). The kernel call to mach_msg_trap() returns MACH_SEND_INTERRUPTED in the first attempt at sending the message. The code within _hurd_intr_rpc_mach_msg() then attempts to resend the message but the message header has been modified by the first call into gnumach. A subsequent call to mach_msg_trap() returns EINTR and it drops into the error handling code of __fsys_getroot() to find that the port in the message header is unexpected and triggers the assertion.

The message header is modified within gnumach by the 'slow_send:' case within mach_msg_trap when ipc_mqueue_send() fails (presumed to be MACH_SEND_INTERRUPTED) to deliver the message.  It is the call to ipc_kmsg_copyout_pseudo() that makes the modification to the kmsg header and ultimately the user space msg header.

I haven't determined the origin of the new port number that ipc_kmsg_copyout_pseudo() inserts into the message header. It is determined from the task's IPC space but it isn't instantly obvious to me why it translates into a seemingly unrelated port number.  Perhaps I've just been staring at the code too long? I can continue staring a while longer if no one can advise. If the port number translation is expected then I can just concentrate on fixing the user end of the code within _hurd_intr_rpc_mach_msg().

I can solve my test case by 'correcting' the message header within the MACH_SEND_INTERRUPTED case of _hurd_intr_rpc_mach_msg(). There are other places in this code that realise that the header is modified by the system call and make a correction to it before attempting to resend the message. Adopting that approach and resetting 'm->header.msgh_local_port = rcv_name' before the 'goto message' resend attempt does appear to stop the occurrence of the assertion. I did actually restore other components of the message header (like it does in the other cases) but some of those might not be necessary (to be investigated/tested). Is this an appropriate solution or might there be other unexpected side effects?

I have also experienced a new 'lockup' involving process termination which I am investigating. It is related to task_terminate() again and I cannot yet tell if it is a consequence of my patch or whether it is simply similar.

I am still aiming to be able to achieve a 24 hour successful test run before making any patch suggestions but I'd welcome any thoughts on what I have determined so far.

Regards,

Mike.



Reply to: