[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New test kernel - second attempt



John Paul Adrian Glaubitz wrote:
Keep in mind you may have to keep the machine off for a longer time or reset
the NV-RAM. We've got multiple reports now of machines that became stable
after that.

With the added experience of console wizardry, I attempted a "cold first boot" of the system into this kernel:

Debian GNU/Linux, with Linux 6.12+unreleased-sparc64-smp

I performed from full power-off (pulled socked plugs) direct poweron, cycle and first fresh boot into that kernel.

It still crashes this way:
[   10.828819] mptsas 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible [   11.166450] NON-RESUMABLE ERROR: Reporting on cpu 4------------------------+ [   11.166573] NON-RESUMABLE ERROR: TPC [0x000000001017e034] <MakeIocReady+0x10/0x298 [mptbase]> [   11.166810] NON-RESUMABLE ERROR: RAW [0410000000000001:0000000cccd2fe4c:0000000202000004:000000ea00300000 [   11.166895] NON-RESUMABLE ERROR: 0000000000040000:0000000000000000:0000000000000000:0000000000000000] [   11.166978] NON-RESUMABLE ERROR: handle [0x0410000000000001] stick [0x0000000cccd2fe4c]
[   11.167051] NON-RESUMABLE ERROR: type [precise nonresumable]
[   11.167114] NON-RESUMABLE ERROR: attrs [0x02000004] < PIO sp-faulted priv >
[   11.167238] NON-RESUMABLE ERROR: raddr [0x000000ea00300000]
[   11.168363] Kernel panic - not syncing: Non-resumable error.
[   11.168443] CPU: 4 UID: 0 PID: 406 Comm: (udev-worker) Not tainted 6.12+unreleased-sparc64-smp #1  Debian 6.12.43-1+nothp1
[   11.168569] Call Trace:
[   11.168622] [<0000000000eff2b4>] dump_stack+0x8/0x18
[   11.168712] [<0000000000ef1930>] panic+0xf4/0x398
[   11.168791] [<000000000042a48c>] sun4v_nonresum_error+0x16c/0x240
[   11.168887] [<0000000000406eb8>] sun4v_nonres_mondo+0xc8/0xd8
[   11.168990] [<000000001017e034>] MakeIocReady+0x10/0x298 [mptbase]
[   11.169096] [<000000001017e4b4>] mpt_do_ioc_recovery+0x9c/0x1110 [mptbase]
[   11.169202] [<000000001017d6f8>] mpt_attach+0xb58/0xd20 [mptbase]
[   11.169305] [<0000000010283f30>] mptsas_probe+0x10/0x440 [mptsas]
[   11.169431] [<0000000000ad1fac>] pci_device_probe+0xac/0x180
[   11.169532] [<0000000000b8b8e8>] really_probe+0xc8/0x400
[   11.169625] [<0000000000b8bcac>] __driver_probe_device+0x8c/0x160
[   11.169720] [<0000000000b8be68>] driver_probe_device+0x28/0x100
[   11.169814] [<0000000000b8c11c>] __driver_attach+0xbc/0x1e0
[   11.169908] [<0000000000b8927c>] bus_for_each_dev+0x5c/0xc0
[   11.169998] [<0000000000b8b09c>] driver_attach+0x1c/0x40
[   11.170089] [<0000000000b8a860>] bus_add_driver+0x180/0x240
[   11.791693] Press Stop-A (L1-A) from sun keyboard or send break
[   11.791693] twice on console to return to the boot prom
[   11.792002] ---[ end Kernel panic - not syncing: Non-resumable error. ]---

At a quick glance the error seems the same as my previous report, but on CPU#4 and not CPU#24


Now that I know to switch to alom, I tried "hot reboots" quickly and to see if something changed.

Run 2: CPU 17
Run 3: CPU 14


Then I did Poweroff/Poweron (did not pull the socket though)
Run 1: CPU 9
Run 2: CPU 0


I would say it looks "random" and interesting is the last one, CPU 0: it looked that it was always a higher number, but well I guess with 32cores...

And I stop here, not going to make 32 reboots to find out !

Riccardo


Reply to: