[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reliable reproducer, was Re: core dump analysis



Hi Finn,

On 23/04/23 21:23, Finn Thain wrote:
On Sun, 23 Apr 2023, Michael Schmitz wrote:

Am 23.04.2023 um 13:41 schrieb Michael Schmitz:

Though the question remains - is this expected behaviour for programs
that do deep recursion on the stack while taking signals (and the reason
for the option to run signal handlers on an alternate stack)?

I don't understand how "deep recursion" can be used to explain this. We've
seen crashes with only 1.8 MB of stack usage.
OK, it's not really deep (though I've managed to get the test case aborted by the oom killer once on my rather puny RAM). But it's putting lots of frames on the stack in a short span while also utilizing the stack for signal delivery.
The best reason I can think of for having a signal stack would be that it
may be better for signal delivery to fail than for the target process to
fail. But I've no idea whether the kernel makes that kind of defensive
programming possible (?)

I don't think there's any provision for signal delivery to fail - the signal handler is started from the return-to-userspace code in entry.S, and upon return from the handler, a sigreturn syscall is automatically executed to clean up the stack. As long as the handler returns, all's fine.

Not sure what happens if the process context that the handler runs in is killed by the kernel - I suppose the entire process is killed and the context removed, so the issue of parent process survival is moot. But I'm sure we can place an illegal instruction in the handler as soon as a stack overflow is spotted, get a dump and look at that.

And why does this almost always appear to happen after bus error exceptions
(frame format b)? The extra exception stack information isn't even accounted
for in the above frame end address!

Result with sa_sigaction handler:

parent usp  : 0xef969e28
handler tos : 0xef969e6c
handler stack overwrote usp!
frame end   : 0xef969e7c
frame start : 0xef969b58
handler usp : 0xef969b40
signal usp  : 0xef969e04
signal pc   : 0x80000696
signal fmtv : 0x114

parent usp  : 0xef955008
handler tos : 0xef955064
handler stack overwrote usp!
frame end   : 0xef955074
frame start : 0xef954d50
handler usp : 0xef954d38
signal usp  : 0xef954ffc
signal pc   : 0x80000680
signal fmtv : 0xb008

parent usp  : 0xef945eb8
handler tos : 0xef945f0c
handler stack overwrote usp!
frame end   : 0xef945f1c
frame start : 0xef945bf8
handler usp : 0xef945be0
signal usp  : 0xef945ea8
signal pc   : 0xc009f37a
signal fmtv : 0x80

parent usp  : 0xef933eb8
handler tos : 0xef933f0c
handler stack overwrote usp!
frame end   : 0xef933f1c
frame start : 0xef933bf8
handler usp : 0xef933be0
signal usp  : 0xef933ea8
signal pc   : 0xc009f37a
signal fmtv : 0x80

parent usp  : 0xef921edc
handler tos : 0xef9aaca4
handler stack overwrote usp!
frame end   : 0xef9aacb4
frame start : 0xef9aa990
handler usp : 0xef9aa978
signal usp  : 0xef9aac40
signal pc   : 0x80000782
signal fmtv : 0x114

Illegal instruction (core dumped)

I don't understand these results. If usp was really overwritten, the
program would have crashed early, no?
I think we're still at the point where rec() is called recursively, before any returns.
Exception right before crash was an interrupt in this case (only seen
that once in this context, though I've seen lots of those in the course
of the test runs). Frame start calculated from siginfo pointer value in
this case.

I didn't realize that you could get a crash from a signal delivered
following an interrupt. I'll try to modify the kernel such that signals
are not delivered after page faults.

Yes, that was news to me, too. I've got swap enabled and probably see a lot more disk I/O than on your machines.

Delaying signal return until the next syscall or interrupt after page fault ought not be too hard - just replace the 'jra ret_from_exception' by 'RESTORE_ALL' (though that would also defer rescheduling until the next interrupt). For a proper solution, replicate exit_work without a call to do_signal_return ...

Cheers,

    Michael


Reply to: