Hi Finn,reproduced on my Falcon (with minor mods to the C source - my version of gcc didn't like asm with no clobbers, so I added "memory" as clobber in the second asm block). In this case it's a4 that is corrupted, but that varies.
depth of 4096 gets me two core dumps on 20 attempts so this isn't quite as fast on my Falcon. With 8192, it's nine.
Example: Core was generated by `./moveml'. Program terminated with signal 4, Illegal instruction. Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld.so.1...done. Loaded symbols for /lib/ld.so.1 #0 0x8000060e in rec () (gdb) info reg d0 0x8000057c -2147482244 d1 0xc0017000 -1073647616 d2 0xd1d2d3d4 -774712364 d3 0xe1e2e3e4 -505224220 d4 0xf1f2f3f4 -235736076 d5 0x80096168 -2146868888 d6 0x80093108 -2146881272 d7 0x0 0 a0 0x0 0x0 a1 0xefdadbdc 0xefdadbdc a2 0x91929394 0x91929394 a3 0xa1a2a3a4 0xa1a2a3a4 a4 0x8000057c 0x8000057c a5 0xc1c2c3c4 0xc1c2c3c4 fp 0xef87402c 0xef87402c sp 0xef874010 0xef874010 ps 0x209 521 pc 0x8000060e 0x8000060e <rec+242> fpcontrol 0x0 0 fpstatus 0x0 0 fpiaddr 0x0 0 (gdb) Am 20.04.2023 um 14:57 schrieb Finn Thain:
On Thu, 20 Apr 2023, Michael Schmitz wrote:Can you try and fault in as many of these stack pages as possible, ahead of filling the stack? (Depending on how much RAM you have ...). Maybe we would need to lock those pages into memory? Just to show that with no page faults (but still signals) there is no corruption?OK.Any signal frames or exception frames have been completely overwritten because the recursion continued after the corruption took place. So there's not much to see in the core dump.We'd need a way to stop recursion once the first corruption has taken place. If the 'safe' recursion depth of 10131 is constant, the dump taken at that point should look similar to what you saw in dash (assuming it is the page fault and subsequent signal return that causes the corruption).It turns out that the recursion depth can be set a lot lower than the 200000 that I chose in that test program. (I used that value as it kept the stack size just below the default 8192 kB limit.)
And it does keep the core a lot smaller. Still not hard to work with on my 14 MB RAM Falcon...
At depth = 2500, a failure is around 95% certain. At depth = 2048 I can still get an intermittent failure. This only required 21 stack pagefaults and one fork. I suspect that the location of the corruption is probably somewhat random, and the larger the stack happens to be when the signal comes in, the better the odds of detection.
Yep, but there must me some more to that. Timing of page faults due to swap bandwidth, perhaps?
Cheers, Michael