Re: New test kernel - second attempt
John Paul Adrian Glaubitz wrote:
Keep in mind you may have to keep the machine off for a longer time or reset
the NV-RAM. We've got multiple reports now of machines that became stable
after that.
With the added experience of console wizardry, I attempted a "cold first
boot" of the system into this kernel:
Debian GNU/Linux, with Linux 6.12+unreleased-sparc64-smp
I performed from full power-off (pulled socked plugs) direct poweron,
cycle and first fresh boot into that kernel.
It still crashes this way:
[ 10.828819] mptsas 0000:07:00.0: Unable to change power state from
D3cold to D0, device inaccessible
[ 11.166450] NON-RESUMABLE ERROR: Reporting on cpu
4------------------------+
[ 11.166573] NON-RESUMABLE ERROR: TPC [0x000000001017e034]
<MakeIocReady+0x10/0x298 [mptbase]>
[ 11.166810] NON-RESUMABLE ERROR: RAW
[0410000000000001:0000000cccd2fe4c:0000000202000004:000000ea00300000
[ 11.166895] NON-RESUMABLE ERROR:
0000000000040000:0000000000000000:0000000000000000:0000000000000000]
[ 11.166978] NON-RESUMABLE ERROR: handle [0x0410000000000001] stick
[0x0000000cccd2fe4c]
[ 11.167051] NON-RESUMABLE ERROR: type [precise nonresumable]
[ 11.167114] NON-RESUMABLE ERROR: attrs [0x02000004] < PIO sp-faulted
priv >
[ 11.167238] NON-RESUMABLE ERROR: raddr [0x000000ea00300000]
[ 11.168363] Kernel panic - not syncing: Non-resumable error.
[ 11.168443] CPU: 4 UID: 0 PID: 406 Comm: (udev-worker) Not tainted
6.12+unreleased-sparc64-smp #1 Debian 6.12.43-1+nothp1
[ 11.168569] Call Trace:
[ 11.168622] [<0000000000eff2b4>] dump_stack+0x8/0x18
[ 11.168712] [<0000000000ef1930>] panic+0xf4/0x398
[ 11.168791] [<000000000042a48c>] sun4v_nonresum_error+0x16c/0x240
[ 11.168887] [<0000000000406eb8>] sun4v_nonres_mondo+0xc8/0xd8
[ 11.168990] [<000000001017e034>] MakeIocReady+0x10/0x298 [mptbase]
[ 11.169096] [<000000001017e4b4>] mpt_do_ioc_recovery+0x9c/0x1110
[mptbase]
[ 11.169202] [<000000001017d6f8>] mpt_attach+0xb58/0xd20 [mptbase]
[ 11.169305] [<0000000010283f30>] mptsas_probe+0x10/0x440 [mptsas]
[ 11.169431] [<0000000000ad1fac>] pci_device_probe+0xac/0x180
[ 11.169532] [<0000000000b8b8e8>] really_probe+0xc8/0x400
[ 11.169625] [<0000000000b8bcac>] __driver_probe_device+0x8c/0x160
[ 11.169720] [<0000000000b8be68>] driver_probe_device+0x28/0x100
[ 11.169814] [<0000000000b8c11c>] __driver_attach+0xbc/0x1e0
[ 11.169908] [<0000000000b8927c>] bus_for_each_dev+0x5c/0xc0
[ 11.169998] [<0000000000b8b09c>] driver_attach+0x1c/0x40
[ 11.170089] [<0000000000b8a860>] bus_add_driver+0x180/0x240
[ 11.791693] Press Stop-A (L1-A) from sun keyboard or send break
[ 11.791693] twice on console to return to the boot prom
[ 11.792002] ---[ end Kernel panic - not syncing: Non-resumable error.
]---
At a quick glance the error seems the same as my previous report, but on
CPU#4 and not CPU#24
Now that I know to switch to alom, I tried "hot reboots" quickly and to
see if something changed.
Run 2: CPU 17
Run 3: CPU 14
Then I did Poweroff/Poweron (did not pull the socket though)
Run 1: CPU 9
Run 2: CPU 0
I would say it looks "random" and interesting is the last one, CPU 0: it
looked that it was always a higher number, but well I guess with 32cores...
And I stop here, not going to make 32 reboots to find out !
Riccardo
Reply to: