Re: Broadcom TG3 network drops, cannot recover without reboot
[Sorry, Henrique, for replying directly to you]
> On 26 May 2015, at 15:39, Henrique de Moraes Holschuh wrote:
> 
> On Tue, May 26, 2015, at 09:24, Justin Catterall wrote:
>> At irregular times, and apparently for no reason at all, networking
>> drops and cannot be restarted without reboot on a fresh install of
>> Jessie. The NIC is a Broadcom NetXtreme BCM5720.
>> 
>> ifconfig thinks networking is still up because I can:
>> 	ifconfig eth0 down
>> 
>> I find this when I try 'ifconfig eth0 up':
>> tg3_abort_hw timed out TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
> 
> Hmm, it is either a kernel issue, or a hardware issue.
> 
>> Any suggestions on where to look for a solution?
> 
> Yes.
> 
> First, disable all hardware offloading using ethtool.  See if that
> helps.
Was able to disable all except: 
  rx-vlan-offload: on [fixed]
  tx-vlan-offload: on [fixed]
Now, if I "/etc/init.d/networking restart" the system doesn't report any error, but networking is still dead. However, I can rmmod tg3|ptp|libphy, then "modprobe tg3" and "/etc/init.d/networking start" and all works (I have done this a handful of times with no need to reboot to re-enable networking). So that's some progress.
> Also, if this NIC is in the system mainboard, make sure you are using
> the latest firmware ("BIOS update") from your motherboard vendor: it is
> usual to have the motherboard NICs use a data block in the shared system
> FLASH for vital product data and firmware. The motherboard vendor will
> bundle up updates for the NIC firmware with the BIOS updates when both
> are in the same FLASH chip.
I've read the documentation for the latest firmware and there is no mention of changes for the NIC, only a "power-on delay option" to allow longer/shorter period of time to hit the key to access the BIOS. And a change to boot device detection to better detect devices with invalid boot records. No other changes mentioned in the firmware. 
Here's a link to the page:
http://h20565.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=5390291&swItemId=MTX_a21cee44c55643598fb2f52bc2&swEnvOid=4144#tab4
I don't like tinkering with firmware if I can help it, in this case they don't say there are changes to the NIC so do you think I should still upgrade? The description says no bugs fixed, only enhancements.
> Make sure you have the latest linux firmware file for the tg3 driver as
> well.  If the initramfs image has the tg3.ko module inside, it must also
> have the firmware file.  A workaround for any initramfs-related tg3
> firmware loading issues is to "rmmod tg3 ; modprobe tg3"  after the
> system booted (and before the NIC hardlocks).
See above, even after rmmod'ing I can still force network restart to fail without error, though it is recoverable if noticed.
> If all of the above failed, get yourself familiar with building a custom
> Debian-compatible kernel using pristine upstream kernels from
> www.kernel.org.  Wait until 3.18.15 and 4.0.5 are released in
> www.kernel.org, and build custom kernels based on them.  Alternatively,
> wait until a debian-packaged version of kernel 4.0.5 is available.  DO
> NOT use 4.0 kernels before 4.0.5 on pain of possible data loss.
Data loss? On a "stable" kernel? WTF are they doing these days? I notice that stable/dev are no longer even/odd major numbers - took me a bit of Googling to get caught up!
> If either the 3.18.15 or 4.0.5 kernel fixes the issue with your bcm5720,
> please tell us so that we can try to isolate the fix and backport it to
> the Debian kernel.
In the mean time I've made a bash-script to rmmod and modprobe as appropriate. I'll set a cron job to ping a couple of other servers on the LAN and execute the script and restart networking should the pings fail.
> If that fails, you will have to engage the kernel community itself for a
> fix.  Please file a bug on bugzilla.kernel.org, and good luck. There are
> several hardware hang reports open against BCM57xx + tg3.
Damn crap hardware. I remember having issues with tg3 at least six or seven years ago. I can believe it's still being incorporated into motherboards when there are obviously problems with the chipset. Depending on speed of progress on the kernel front I may just stick a PCI NIC in there - I think I still have some 3c509's around somewhere... 
> Alternatively, try to get yourself an Intel NIC that works with the igb
> driver (don't get an Intel NIC that needs the e1000e driver) to replace
> the hardlock-prone bcm5720 + tg3 combination.
Thanks for the pointers. I at least have a situation now where I don't need a reboot to get networking functioning after it fails. It's far from perfect, but it's much, much better.
-- 
Justin C, by the sea.
Reply to: