Re: core dump analysis, was Re: stack smashing detected
On Mon, 3 Apr 2023, Michael Schmitz wrote:
> Am 02.04.2023 um 21:31 schrieb Finn Thain:
> >
> >>
> >> Maybe an interaction between (multiple?) signals and syscall
> >> return...
> >
> > When running dash from gdb in QEMU, there's only one signal (SIGCHLD)
> > and it gets handled before __wait3() returns. (Of course, the "stack
> > smashing detected" failure never shows up in QEMU.)
>
> Might be a clue that we need multiple signals to force the stack
> smashing error. And we might not get that in QEMU, due to the faster
> execution in emulating on a modern processor.
>
Right -- being that the failure is intermittent on real hardware, it's not
surprising that I can't make it show up in QEMU or Aranym.
But no-one has reproduced it on Atari or Amiga hardware yet so I guess it
could be a driver issue...
I wonder whether anyone else is actually running recent Debian/SID with
sysvinit and without a Debian initrd on a Motorola 68030 system.
> Thinking a bit more about interactions between signal delivery and
> syscall return, it turns out that we don't check for pending signals
> when returning from a syscall. That's OK on SMP systems, because we
> don't have another process running while we execute the syscall (and we
> _do_ run signal handling when scheduling, i.e. when wait4 sleeps or is
> woken up)?
>
> Seems we can forget about that interaction then.
>
> >
> >> depends on how long we sleep in wait4, and whether a signal happens
> >> just during that time.
> >>
> >
> > I agree, there seems to be a race condition there. (And dash's
> > waitproc() seems to take pains to reap the child and handle the signal
> > in any order.)
>
> Yes, it makes sure the SIGCHLD is seen no matter in what order the
> signals are delivered ...
>
> > I wouldn't be surprised if this race somehow makes the failure rare.
> >
> > I don't want to recompile any userland binaries at this stage, so it
> > would be nice if we could modify the kernel to keep track of exactly
> > how that race gets won and lost. Or perhaps there's an easy way to rig
> > the outcome one way or the other.
>
> A race between syscall return due to child exit and signal delivery
> seems unlikely, but maybe there is a race between syscall return due to
> a timer firing and signal delivery. Are there any timers set to
> periodically interrupt wait3?
>
I searched the source code and SIGALRM appears to be unused by dash. And
'timeout' is not a dash builtin. But that doesn't mean we don't get
multiple signals. One crashy script looks like this:
TMPFS_SIZE="$(tmpfs_size_vm "$TMPFS_SIZE")"
RUN_SIZE="$(tmpfs_size_vm "$RUN_SIZE")"
LOCK_SIZE="$(tmpfs_size_vm "$LOCK_SIZE")"
SHM_SIZE="$(tmpfs_size_vm "$SHM_SIZE")"
TMP_SIZE="$(tmpfs_size_vm "$TMP_SIZE")"
Is it possible that the SIGCHLD from the first sub-shell got delayed?
>
> Still no nearer to a solution - something smashes the stack near %sp,
> causes the %a3 register restore after __GI___wait4_time64 to return a
> wrong pointer to the stack canary, and triggers a stack smashing warning
> in this indirect way. But what??
>
I've no idea.
The actual corruption might offer a clue here. I believe the saved %a3 was
clobbered with the value 0xefee1068 which seems to be a pointer into some
stack frame that would have come into existence shortly after
__GI___wait4_time64 was called. That stack frame is gone by the time the
core dump was made. Was it dash's signal handler, onsig(), or some libc
subroutine called by __GI___wait4_time64(), or was it something that the
kernel put there?
Dash's SIGCHLD handler looks safe enough -- I don't see how it could
corrupt the saved registers in the __GI___wait4_time64 stack frame (it's
not like 1 was stored in the wrong place).
https://sources.debian.org/src/dash/0.5.12-2/src/trap.c/?hl=285#L285
Reply to: