[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: core dump analysis, was Re: stack smashing detected



On Mon, 3 Apr 2023, Michael Schmitz wrote:

> Am 02.04.2023 um 21:31 schrieb Finn Thain:
> >
> >>
> >> Maybe an interaction between (multiple?) signals and syscall 
> >> return...
> >
> > When running dash from gdb in QEMU, there's only one signal (SIGCHLD) 
> > and it gets handled before __wait3() returns. (Of course, the "stack 
> > smashing detected" failure never shows up in QEMU.)
> 
> Might be a clue that we need multiple signals to force the stack 
> smashing error. And we might not get that in QEMU, due to the faster 
> execution in emulating on a modern processor.
> 

Right -- being that the failure is intermittent on real hardware, it's not 
surprising that I can't make it show up in QEMU or Aranym.

But no-one has reproduced it on Atari or Amiga hardware yet so I guess it 
could be a driver issue...

I wonder whether anyone else is actually running recent Debian/SID with 
sysvinit and without a Debian initrd on a Motorola 68030 system.

> Thinking a bit more about interactions between signal delivery and 
> syscall return, it turns out that we don't check for pending signals 
> when returning from a syscall. That's OK on SMP systems, because we 
> don't have another process running while we execute the syscall (and we 
> _do_ run signal handling when scheduling, i.e. when wait4 sleeps or is 
> woken up)?
> 
> Seems we can forget about that interaction then.
> 
> >
> >> depends on how long we sleep in wait4, and whether a signal happens 
> >> just during that time.
> >>
> >
> > I agree, there seems to be a race condition there. (And dash's 
> > waitproc() seems to take pains to reap the child and handle the signal 
> > in any order.)
> 
> Yes, it makes sure the SIGCHLD is seen no matter in what order the 
> signals are delivered ...
> 
> > I wouldn't be surprised if this race somehow makes the failure rare.
> >
> > I don't want to recompile any userland binaries at this stage, so it 
> > would be nice if we could modify the kernel to keep track of exactly 
> > how that race gets won and lost. Or perhaps there's an easy way to rig 
> > the outcome one way or the other.
> 
> A race between syscall return due to child exit and signal delivery 
> seems unlikely, but maybe there is a race between syscall return due to 
> a timer firing and signal delivery. Are there any timers set to 
> periodically interrupt wait3?
> 

I searched the source code and SIGALRM appears to be unused by dash. And 
'timeout' is not a dash builtin. But that doesn't mean we don't get 
multiple signals. One crashy script looks like this:

TMPFS_SIZE="$(tmpfs_size_vm "$TMPFS_SIZE")"
RUN_SIZE="$(tmpfs_size_vm "$RUN_SIZE")"
LOCK_SIZE="$(tmpfs_size_vm "$LOCK_SIZE")"
SHM_SIZE="$(tmpfs_size_vm "$SHM_SIZE")"
TMP_SIZE="$(tmpfs_size_vm "$TMP_SIZE")"

Is it possible that the SIGCHLD from the first sub-shell got delayed?

> 
> Still no nearer to a solution - something smashes the stack near %sp, 
> causes the %a3 register restore after __GI___wait4_time64 to return a 
> wrong pointer to the stack canary, and triggers a stack smashing warning 
> in this indirect way. But what??
> 

I've no idea.

The actual corruption might offer a clue here. I believe the saved %a3 was 
clobbered with the value 0xefee1068 which seems to be a pointer into some 
stack frame that would have come into existence shortly after 
__GI___wait4_time64 was called. That stack frame is gone by the time the 
core dump was made. Was it dash's signal handler, onsig(), or some libc 
subroutine called by __GI___wait4_time64(), or was it something that the 
kernel put there?

Dash's SIGCHLD handler looks safe enough -- I don't see how it could 
corrupt the saved registers in the __GI___wait4_time64 stack frame (it's 
not like 1 was stored in the wrong place). 
https://sources.debian.org/src/dash/0.5.12-2/src/trap.c/?hl=285#L285


Reply to: