Re: core dump analysis, was Re: stack smashing detected

To: Michael Schmitz <schmitzmic@gmail.com>
Cc: debian-68k@lists.debian.org, linux-m68k@lists.linux-m68k.org
Subject: Re: core dump analysis, was Re: stack smashing detected
From: Finn Thain <fthain@linux-m68k.org>
Date: Tue, 4 Apr 2023 14:05:06 +1000 (AEST)
Message-id: <[🔎] 69f70488-b84d-7c24-d001-529896f3bd34@linux-m68k.org>
In-reply-to: <[🔎] d6bdf882-9470-4d05-bd3c-36a1f3410fc1@gmail.com>
References: <4a9c1d0d-07aa-792e-921f-237d5a30fc44.ref@yahoo.com> <8042d988-6dd9-8170-60e9-cdf19118440f@yahoo.com> <a8f06e4b-db28-c8f9-5e21-3ea0f3eebacd@linux-m68k.org> <bb27b393-3d02-f42c-5c7f-c27d4936ece9@linux-m68k.org> <37da2ca2-dd99-8417-7cae-a88e2e7fc1b6@yahoo.com> <30a1be59-a1fd-f882-1072-c7db8734b1f1@gmail.com> <39f79c2d-e803-d7b1-078f-8757ca9b1238@yahoo.com> <c47abfdc-31c8-e7ed-1c14-90f68710f25d@gmail.com> <040ad66a-71dd-001b-0446-36cbd6547b37@yahoo.com> <5b9d64bb-2adc-20a2-f596-f99bf255b5cc@linux-m68k.org> <56bd9a33-c58a-58e0-3956-e63c61abe5fe@yahoo.com> <1725f7c1-2084-a404-653d-9e9f8bbe961c@linux-m68k.org> <e10b8e06-6a36-5c83-89da-bec8fd7d3ed9@linux-m68k.org> <19d1f2ac-67dd-5415-b64a-1e1b4451f01e@linux-m68k.org> <[🔎] ef5bcf6f-3541-50c2-9b6a-5e9d2f9c68d5@linux-m68k.org> <[🔎] 6c4d497e-6f0e-35cc-908e-9ef98151a56a@gmail.com> <[🔎] 44c29a26-a470-c03a-da55-0a9f6d347edd@linux-m68k.org> <[🔎] d6bdf882-9470-4d05-bd3c-36a1f3410fc1@gmail.com>

On Mon, 3 Apr 2023, Michael Schmitz wrote:

> Am 02.04.2023 um 21:31 schrieb Finn Thain:
> >
> >>
> >> Maybe an interaction between (multiple?) signals and syscall 
> >> return...
> >
> > When running dash from gdb in QEMU, there's only one signal (SIGCHLD) 
> > and it gets handled before __wait3() returns. (Of course, the "stack 
> > smashing detected" failure never shows up in QEMU.)
> 
> Might be a clue that we need multiple signals to force the stack 
> smashing error. And we might not get that in QEMU, due to the faster 
> execution in emulating on a modern processor.
> 

Right -- being that the failure is intermittent on real hardware, it's not 
surprising that I can't make it show up in QEMU or Aranym.

But no-one has reproduced it on Atari or Amiga hardware yet so I guess it 
could be a driver issue...

I wonder whether anyone else is actually running recent Debian/SID with 
sysvinit and without a Debian initrd on a Motorola 68030 system.

> Thinking a bit more about interactions between signal delivery and 
> syscall return, it turns out that we don't check for pending signals 
> when returning from a syscall. That's OK on SMP systems, because we 
> don't have another process running while we execute the syscall (and we 
> _do_ run signal handling when scheduling, i.e. when wait4 sleeps or is 
> woken up)?
> 
> Seems we can forget about that interaction then.
> 
> >
> >> depends on how long we sleep in wait4, and whether a signal happens 
> >> just during that time.
> >>
> >
> > I agree, there seems to be a race condition there. (And dash's 
> > waitproc() seems to take pains to reap the child and handle the signal 
> > in any order.)
> 
> Yes, it makes sure the SIGCHLD is seen no matter in what order the 
> signals are delivered ...
> 
> > I wouldn't be surprised if this race somehow makes the failure rare.
> >
> > I don't want to recompile any userland binaries at this stage, so it 
> > would be nice if we could modify the kernel to keep track of exactly 
> > how that race gets won and lost. Or perhaps there's an easy way to rig 
> > the outcome one way or the other.
> 
> A race between syscall return due to child exit and signal delivery 
> seems unlikely, but maybe there is a race between syscall return due to 
> a timer firing and signal delivery. Are there any timers set to 
> periodically interrupt wait3?
> 

I searched the source code and SIGALRM appears to be unused by dash. And 
'timeout' is not a dash builtin. But that doesn't mean we don't get 
multiple signals. One crashy script looks like this:

TMPFS_SIZE="$(tmpfs_size_vm "$TMPFS_SIZE")"
RUN_SIZE="$(tmpfs_size_vm "$RUN_SIZE")"
LOCK_SIZE="$(tmpfs_size_vm "$LOCK_SIZE")"
SHM_SIZE="$(tmpfs_size_vm "$SHM_SIZE")"
TMP_SIZE="$(tmpfs_size_vm "$TMP_SIZE")"

Is it possible that the SIGCHLD from the first sub-shell got delayed?

> 
> Still no nearer to a solution - something smashes the stack near %sp, 
> causes the %a3 register restore after __GI___wait4_time64 to return a 
> wrong pointer to the stack canary, and triggers a stack smashing warning 
> in this indirect way. But what??
> 

I've no idea.

The actual corruption might offer a clue here. I believe the saved %a3 was 
clobbered with the value 0xefee1068 which seems to be a pointer into some 
stack frame that would have come into existence shortly after 
__GI___wait4_time64 was called. That stack frame is gone by the time the 
core dump was made. Was it dash's signal handler, onsig(), or some libc 
subroutine called by __GI___wait4_time64(), or was it something that the 
kernel put there?

Dash's SIGCHLD handler looks safe enough -- I don't see how it could 
corrupt the saved registers in the __GI___wait4_time64 stack frame (it's 
not like 1 was stored in the wrong place). 
https://sources.debian.org/src/dash/0.5.12-2/src/trap.c/?hl=285#L285

Reply to:

Follow-Ups:
- Re: core dump analysis, was Re: stack smashing detected
  - From: Finn Thain <fthain@linux-m68k.org>

References:
- Re: core dump analysis, was Re: stack smashing detected
  - From: Finn Thain <fthain@linux-m68k.org>
- Re: core dump analysis, was Re: stack smashing detected
  - From: Michael Schmitz <schmitzmic@gmail.com>
- Re: core dump analysis, was Re: stack smashing detected
  - From: Finn Thain <fthain@linux-m68k.org>
- Re: core dump analysis, was Re: stack smashing detected
  - From: Michael Schmitz <schmitzmic@gmail.com>

Prev by Date: Re: core dump analysis, was Re: stack smashing detected
Next by Date: Re: core dump analysis, was Re: stack smashing detected
Previous by thread: Re: core dump analysis, was Re: stack smashing detected
Next by thread: Re: core dump analysis, was Re: stack smashing detected
Index(es):
- Date
- Thread