Bug#990642: linux-image-4.19.0-17-amd64: kernel panic on xen dom0 with Broadcom Limited NetXtreme II BCM5709

To: spi@gmxpro.de, 990642@bugs.debian.org, Salvatore Bonaccorso <carnil@debian.org>
Subject: Bug#990642: linux-image-4.19.0-17-amd64: kernel panic on xen dom0 with Broadcom Limited NetXtreme II BCM5709
From: Hans van Kranenburg <hans@knorrie.org>
Date: Thu, 30 Sep 2021 22:03:03 +0200
Message-id: <[🔎] 52555765-fb66-de85-766b-3f74f57354ff@knorrie.org>
Reply-to: Hans van Kranenburg <hans@knorrie.org>, 990642@bugs.debian.org
In-reply-to: <eb85636c-a804-85b0-503c-8c8a02c49add@gmxpro.de>
References: <d4afa918-4200-b8d6-bdd4-b3a909fbbe20@gmxpro.de> <YOMBLQ+iXpCPURR4@eldamar.lan> <6a4f6a3b-4729-39c8-d371-56e4fa62df63@gmxpro.de> <YOn6FbUHPgdtgfwj@eldamar.lan> <d4afa918-4200-b8d6-bdd4-b3a909fbbe20@gmxpro.de> <97dd4d7a-5eaf-e0bd-bc05-686252477c4a@gmxpro.de> <YO6viagJ76xvdblJ@eldamar.lan> <8cc2246e-74d7-fda3-2b89-a925e7b3c7f3@gmxpro.de> <YQMXjm5inBfq5FGB@eldamar.lan> <d4afa918-4200-b8d6-bdd4-b3a909fbbe20@gmxpro.de> <eb85636c-a804-85b0-503c-8c8a02c49add@gmxpro.de> <d4afa918-4200-b8d6-bdd4-b3a909fbbe20@gmxpro.de>

Hi spi, Salvatore,

On 8/5/21 1:58 PM, spi@gmxpro.de wrote:
> 
> In preparation for the bug report for upstream I did some more
> investigation.
> 
> The kernel panic also occurs without bonding interfaces but needs much
> more time to happen. With a bonding interface it happens within some
> seconds. Without bonding interfaces it needs like a minute with the
> network discovery being re-launched for 2 or 3 times. The kernel panic
> is still the same about the bnx2 driver.
> 
> In the constellation without a bonding interface the kernel panic only
> occurs if
> - opnsense as a domU is running (this domU bounds all bridged interfaces
> as default gateway for all networks)

Just FWIW, I'm seeing this bug-mail-thread now, and it rings a bell.

I spent some time in the past to debug crashing BCM5719 (4x1G) nics in
HP DL360 G8/9 series servers. In this case, the firmware inside the nic
crashed, so the symptoms were different. This happened only when having
a Xen domU active as router, which was routing incoming traffic packets
(from outside the box) back to the outside again.

02:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
Ethernet PCIe (rev 01)
02:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
Ethernet PCIe (rev 01)
02:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
Ethernet PCIe (rev 01)
02:00.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
Ethernet PCIe (rev 01)

Also, 2x 1G were bonded, I use openvswitch with LACP for that.

The symptoms are obviously different, mine looked like this:

tg3 0000:02:00.2 eth1: transmit timed out, resetting
tg3 0000:02:00.2 eth1: 0x00000000: 0x165714e4, 0x00100546, 0x02000001,
0x00800010
tg3 0000:02:00.2 eth1: 0x00000010: 0x92b3000c, 0x00000000, 0x92b4000c,
0x00000000
tg3 0000:02:00.2 eth1: 0x00000020: 0x92b5000c, 0x00000000, 0x00000000,
0x22be103c

tg3 0000:02:00.2 eth1: 0x00007000: 0x08000008, 0x00000000, 0x00000000,
0x00004cd8
tg3 0000:02:00.2 eth1: 0x00007010: 0xdbbd2b97, 0x010080f3, 0x00d70081,
0x03008200
tg3 0000:02:00.2 eth1: 0x00007020: 0x00000000, 0x00000000, 0x00000406,
0x10004000
tg3 0000:02:00.2 eth1: 0x00007030: 0x00020000, 0x00004cdc, 0x001f0000,
0x00000000
tg3 0000:02:00.2 eth1: 0: Host status block
[00000001:00000070:(0000:0563:0000):(0000:0094)]
tg3 0000:02:00.2 eth1: 0: NAPI info
[00000070:00000070:(016a:0094:01ff):0000:(068c:0000:0000:0000)]
tg3 0000:02:00.2 eth1: 1: Host status block
[00000001:00000083:(0000:0000:0000):(015b:0000)]
tg3 0000:02:00.2 eth1: 1: NAPI info
[00000051:00000051:(0000:0000:01ff):0124:(0124:0124:0000:0000)]
tg3 0000:02:00.2 eth1: 2: Host status block
[00000001:000000d8:(0e96:0000:0000):(0000:0000)]
tg3 0000:02:00.2 eth1: 2: NAPI info
[000000a4:000000a4:(0000:0000:01ff):0e5b:(065b:065b:0000:0000)]
tg3 0000:02:00.2 eth1: 3: Host status block
[00000001:00000013:(0000:0000:0000):(0000:0000)]
tg3 0000:02:00.2 eth1: 3: NAPI info
[000000f8:000000f8:(0000:0000:01ff):072f:(072f:072f:0000:0000)]
tg3 0000:02:00.2 eth1: 4: Host status block
[00000001:0000009c:(0000:0000:0736):(0000:0000)]
tg3 0000:02:00.2 eth1: 4: NAPI info
[0000007c:0000007c:(0000:0000:01ff):0716:(0716:0716:0000:0000)]
tg3 0000:02:00.2: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:02:00.2: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:02:00.2 eth1: Link is down
tg3 0000:02:00.2 eth1: Link is up at 1000 Mbps, full duplex
tg3 0000:02:00.2 eth1: Flow control is off for TX and off for RX
tg3 0000:02:00.2 eth1: EEE is disabled

> - sysctl parameter net.bridge.bridge-nf-call-ip6tables is set to 0.
> 
> If both conditions are not met no kernel panic oaccurs.

What I found out in the end is that using `ethtool -K $iface tso off` is
a workaround to not make it trigger some obscure bug inside the nic that
makes it crash.

So, I think my actual suggestion would be, even while it does not look
like the same thing, but it's still Broadcom stuff which can have
*cough* weird issues... if you can reliably reproduce the problem, then
can you try setting tso off on the physical interfaces in dom0 and try
again? In Dutch we say "nooit geschoten altijd mis".

> Other IPv6 related sysctl parameters are set on dom0 like
> net.ipv6.conf.all.disable_ipv6 = 1
> net.ipv6.conf.default.disable_ipv6 = 1
> net.ipv6.conf.lo.disable_ipv6 = 1
> 
> 
> The layer2-iptables settings are
> net.bridge.bridge-nf-call-ip6tables = 0 ***
> 
> 
> net.bridge.bridge-nf-call-iptables = 1
> 
> 
> net.bridge.bridge-nf-call-arptables = 0
> 
> 
> 
> 
> As said, if I don't set the one marked with *** to 0 there is no kernel
> panic.
> 
> I wonder if this still is a kernel issue but still wouldn't expect a
> kernel panic to happen.
> 
> Cheers,
> spi
> 

Have fun,
Hans

Reply to:

Prev by Date: Processing of linux_5.10.70-1_source.changes
Next by Date: Processing of linux-latest_105+deb10u13_source.changes
Previous by thread: Bug#990642: linux-image-4.19.0-17-amd64: kernel panic on xen dom0 with Broadcom Limited NetXtreme II BCM5709
Next by thread: Bug#995407: linux: Please enable CONFIG_MHI_BUS_PCI_GENERIC and CONFIG_MHI_NET
Index(es):
- Date
- Thread