[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#858125: e1000: ethernet interface hangs occasionally, kernel reports hang



On Wed, Mar 22, 2017 at 02:42:30AM +0000, Ben Hutchings wrote:
> Control: retitle -1 TX watchdog fires on e1000e interface with flow control enabled
> 
> On Tue, 2017-03-21 at 18:36 -0400, Bruce Momjian,,, wrote:
> > On Tue, Mar 21, 2017 at 04:04:11PM -0400, Bruce Momjian,,, wrote:
> > > I think this proves my problems are related to flow control.  How would
> > > you like to proceed?  Is there a patch or change you would like me to
> > > test?  Just close the ticket?
> > > 
> > > I have a fix, but it is likely others would not know they had this
> > > problem unless they were monitoring their kernel logs or their network
> > > traffic for lag.
> > 
> > Oh, I should also mention the port that is having problems is connected
> > to a NetGear GS108Ev3 switch, with current firmware, version 2.00.09. 
> > The port connected to my Actiontec FIOS router is not having problems.
> 
> I don't know about any specific bug, but if the switch sends flow
> control XOFF frames continually for long enough (usually 5 seconds)
> this will trigger the TX watchdog.

Makes sense.

> It sounds like your switch implements flow control properly (some
> broken switches auto-negotiate it but actually flood flow control
> frames).  However, if a device on some other port (that also has flow

If I turn off flow control on the switch port, and leave the Debian
server at defaults, the Debian port automatically turns off flow
control, which must be what 'autoneg' is meant to do.

> control enabled) sends XOFF frames continually *and* your server sends
> frames that should go to that other port, the switch will do the same
> to the server once the switch's internal queue has filled up.

What I could do it to turn off flow control on all switch ports _except_
the Debian server.  The switch has per-port flow control management
control.

> If the switch has port statistics including numbers of pause frames
> then you can see where they are coming from, but I think it doesn't.
> Without that information it's going to be hard to tell exactly where
> the fault lies.

Yeah, I don't see flow control stats on the switch, just CRC error
reporting.

> The e1000e driver *does* have statistics for pause frames transmitted
> and received (run: "ethtool -S eth0| grep flow_control").  If you log
> these every second then it should be possible to see what happens
> around the time the TX watchdog fires.  That could provide some clues
> as to whether the NIC is behaving correctly.

OK, I am running this after setting flow control on/default on the
switch and Debian, and rebooting:

	daemon -- sh -c "while :; do date;ethtool -S eth0| grep flow_control;
	sleep 1;done > /root/ethtool"

I will report back with the relevant logging lines once it hangs again. 
Thanks.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Reply to: