Kernel Panics / Re: GRUB testers on SPARC needed

To: debian-sparc@lists.debian.org
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Subject: Kernel Panics / Re: GRUB testers on SPARC needed
From: Robin Cremer <robin.cremer@medicem.de>
Date: Mon, 17 May 2021 12:36:18 +0200
Message-id: <[🔎] 5f1f50e5-8839-5012-b6a4-6381ff6c8cb0@medicem.de>
In-reply-to: <[🔎] 4116a506-d23d-c591-705c-009cc1c30249@physik.fu-berlin.de>
References: <97c69b9f-6cd4-04c8-80e4-1bc74602147e@physik.fu-berlin.de> <[🔎] 1beaa320-f78a-a485-65c2-66ef6cac99a3@medicem.de> <[🔎] 4116a506-d23d-c591-705c-009cc1c30249@physik.fu-berlin.de>

Hi Adrian,
for the sake of visibility, here the response to the kernel-trouble:

Am 17.05.2021 um 10:23 schrieb John Paul Adrian Glaubitz:

Installing on two SunFire v215 went reasonably well

/- (apart from recurring Kernel Panics with "Unable to handle kernel paging request in mna handler",
most often triggered on boot immediately after the systemd binfmt service tries to start. This seems
to have been mentioned in /2020/04/msg00020.html but never pinpointed and fixed?) -/

What kernel version are you running. There have actually been some fixes in this regard, in particular
this fix:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/sparc?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f

I'm using the latest version from the repositories:

5.10.0-6-sparc64-smp #1 SMP Debian 5.10.28-1 (2021-04-09) sparc64GNU/Linux

The commit you mention seems to be in 5.12 and 5.13-rc2?

Is there a pre-built SMP-image for this or do I have to set up buildingmyself?

Running on the v215 was a real nightmare yesterday. They will be stablefor hours, but certain actions (in one occurence executing dmesg (!),apt-get installing nfs-common, some mounts systemd tries) crashed bothof the boxes with various errors, most of them the dreaded line with "pRetrying the dpkg-reconfigure as well. After the systemd-unit for - Ithink - rpcbind was activated during package configuration, both boxescrashed about 4-6 times, I had to reset from OBP.

After a few tries, installation went through, then. Now I can mount nfs...

The hours I spent in the rescue mode of the current installation CDwithout any trouble made me suspect that non-SMP-kernels are "morestable". I'm currently running the SMP variant with "maxcpus=1", thatseems stable so far... But as with any other sporadic issue, that's hardto tell with a way to reliably trigger the errors...


The worst offender so far seems to be xfs, though...
I initially installed both v215 with ext3 /boot and xfs for /.

I'm not sure if the problem is related, but xfs seems to frequentlyencounter

[ 35.325122] XFS (md1): Metadata corruption detected atxfs_dinode_verify.part.0+0x358/0x6c0 [xfs], inode 0x402c4d0 dinode
[   35.469639] XFS (md1): Unmount and run xfs_repair

on both machines. xfs_repair doesn't do anything, though. Either, theseinodes were the last ones written during kernel panics, or theunderlying issue of the panicsleads to checksum-mismatches in-memory? The latter seems more likely,because during dpkg-installs the following popped up a few times as well:

[ 195.360257] XFS (md1): Corruption of in-memory data detected. Shutting down filesystem

(after that, obviously, the system is unusable despite not panicking, asroot is missing entirely...)


Some faults:

[ 281.304119] WARNING: CPU: 1 PID: 11 at kernel/smp.c:633smp_call_function_many_cond+0x3bc/0x3e0[ 281.418696] Modules linked in: ext4(E) crc16(E) mbcache(E) jbd2(E)sr_mod(E) cdrom(E) ata_generic(E) tg3(E) libphy(E) ptp(E) ohci_pci(E)sg(E) pata_ali(E) ehci_pci(E) ohci_hcd(E) ehci_hcd(E) libata(E)pps_core(E) usbcore(E) usb_common(E) flash(E) drm(E)drm_panel_orientation_quirks(E) i2c_core(E) fuse(E) configfs(E)ip_tables(E) x_tables(E) autofs4(E) xfs(E) raid10(E) raid456(E)async_raid6_recov(E) async_memcpy(E) async_pq(E) raid6_pq(E)async_xor(E) xor(E) async_tx(E) libcrc32c(E) crc32c_generic(E)raid0(E) multipath(E) linear(E) raid1(E) md_mod(E) sd_mod(E) t10_pi(E)crc_t10dif(E) crct10dif_generic(E) crct10dif_common(E) mptsas(E)scsi_transport_sas(E) mptscsih(E) mptbase(E) scsi_mod(E)[ 282.224447] CPU: 1 PID: 11 Comm: ksoftirqd/1 Tainted: G D E 5.10.0-6-sparc64-smp #1 Debian 5.10.28-1
[  282.359710] Call Trace:
[  282.391788] [<000000000046c67c>] __warn+0xbc/0x120
[  282.454810] [<0000000000c450f8>] warn_slowpath_fmt+0x34/0x74
[ 282.529285] [<0000000000517b5c>]smp_call_function_many_cond+0x3bc/0x3e0
[  282.617512] [<0000000000517be4>] smp_call_function+0x24/0x40
[  282.691989] [<0000000000441828>] smp_send_stop+0x28/0x120
[  282.763028] [<0000000000c44e84>] panic+0x110/0x350
[  282.826047] [<0000000000472ad0>] do_exit+0xad0/0xb20
[  282.891357] [<0000000000c43ab0>] die_if_kernel+0x1f4/0x260
[  282.963543] [<0000000000c5501c>] unhandled_fault+0x88/0xac
[  283.035728] [<0000000000c553c8>] do_sparc64_fault+0x388/0xa80
[  283.111354] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[  283.193850] [<00000000005a1e64>] __bpf_prog_put_rcu+0x24/0x60
[  283.269470] [<00000000004f5c20>] rcu_core+0x240/0x620
[  283.335926] [<00000000004f600c>] rcu_core_si+0xc/0x20
[  283.402383] [<0000000000c5602c>] __do_softirq+0x10c/0x3a0
[  283.473423] [<0000000000473b14>] run_ksoftirqd+0x34/0x60
[  283.543315] ---[ end trace 9f0a29fcdf85be47 ]---

[ 124.914048] CPU[1]: Cheetah+ D-cache parity error atTPC[00000000005bc2b0]

[  125.004638] TPC<bpf_check+0x1cd0/0x32e0>
nfs-utils.service is a disabled or a static unit, not starting it.

[ 125.528183] Kernel unaligned access at TPC[8ffba4]atomic64_sub_return+0x4/0x54

[  125.624591] Unable to handle kernel paging request in mna handler
[  125.624595]  at virtual address 6f430c861b2ffaab
[  125.765686] current->{active_,}mm->context = 00000000000000c1
[  125.841410] current->{active_,}mm->pgd = fff0000001b94000
[  125.912544]               \|/ ____ \|/
[  125.912544]               "@'/ .. \`@"
[  125.912544]               /_| \__/ |_\
[  125.912544]                  \__U_/
[  126.106299] systemd(1): Oops [#1]

Especially the "Unable to handle kernel paging request in mna handler"is interesting. It's nearly identical to the issue posted ~ a year ago,seemingly introduced somewhere around the Kernel 5.0-time.It's not always accompanied by the "Cheetah+ D-cache parity error atTPC[00000000005bc2b0]"-error. And while CPU L1-Cache parity errors"seem" like a hardware issue, I highly suspect it is not. Both machinesshow these errors sporadically (often without panicking!) and I foundmention of these errors in other contexts...Also, both systems were highly stable in the past, exceeding 2-3 yearsof uptime, although on Solaris 10 :-(



On a not entirely unrelated note:

Are there any news on functioning netboot images? The last post I couldfind points to images from April '17 on your webspace, which were,according to the ML, not bootable because of the size.

At least I can't boot them either.

If there is no more recent version, I'll try to build something myself -are there any pointers on how to go about this? Minimal OS or thenetinstaller in an .img would be preferred.

I think that would help in quick testing, as I have multiple othersystems with Cheetah (UltraSPARC III, III CU and IIIi) I'd like to tryprovoking the panics on.Also, some older (UltraSPARC IIi and IIe+) systems are waiting forrecent Debian :-)



Thanks,

- Robin

Reply to:

Follow-Ups:
- Re: Kernel Panics / Re: GRUB testers on SPARC needed
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
- Re: Kernel Panics / Re: GRUB testers on SPARC needed
  - From: Frank Scheiner <frank.scheiner@web.de>

References:
- Re: GRUB testers on SPARC needed
  - From: Robin Cremer <robin.cremer@medicem.de>
- Re: GRUB testers on SPARC needed
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>

Prev by Date: Re: GRUB testers on SPARC needed
Next by Date: Re: Kernel Panics / Re: GRUB testers on SPARC needed
Previous by thread: Re: GRUB testers on SPARC needed
Next by thread: Re: Kernel Panics / Re: GRUB testers on SPARC needed
Index(es):
- Date
- Thread