Bug#721316: base: NETDEV WATCHDOG: eth0 (igb): transmit queue 0 timed out
Hi,
We're experiencing what appears to be the same problem as well on a
Pacemaker cluster of ours; this is causing us serious issues as the
nodes are rebooted when the problem appears.
Has any progress been made in identifying a cause for this and/or curing
the problem?
>From dmesg:
> Dec 28 23:16:32 tyne kernel: [418756.268195] WARNING: at /build/linux-rrsxby/linux-3.2.51/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
> Dec 28 23:16:32 tyne kernel: [418756.382761] Hardware name: X9DRD-iF
> Dec 28 23:16:32 tyne kernel: [418756.496392] NETDEV WATCHDOG: eth1 (igb): transmit queue 1 timed out
> Dec 28 23:16:33 tyne kernel: [418756.607364] Modules linked in: hmac dlm sctp libcrc32c configfs ip6table_filter ebtable_nat ebtables act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_statistic xt_CT xt_time xt_connlimit xt_realm xt_addrtype iptable_raw xt_comment
> xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_con
> ntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_p
> hysdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_
> Dec 28 23:16:34 tyne kernel: mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_
> core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding sha1_ssse3 sha1_generic ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache bridge stp loop kvm_intel kvm snd_pcm s
> nd_timer coretemp snd soundcore acpi_cpufreq crc32c_intel ghash_clmulni_intel mperf aesni_intel psmouse snd_page_alloc cryptd iTCO_wdt sb_edac processor i2c_i801 serio_raw aes_x86_64 ioatdma pcspkr iTCO_vendor_support aes_generic thermal_sys i2c_core joydev edac_core evdev container button acpi_pad ext4 crc16 jbd2 m
> bcache dm_mod raid1 md_mod microcode usbhid hid sg sd_mod crc_t10dif ahci lib
> Dec 28 23:16:34 tyne kernel: ahci isci libsas libata ehci_hcd scsi_transport_sas usbcore igb scsi_mod usb_common dca [last unloaded: scsi_wait_scan]
> Dec 28 23:16:34 tyne kernel: [418758.541550] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.51-1
> Dec 28 23:16:34 tyne kernel: [418758.652098] Call Trace:
> Dec 28 23:16:35 tyne kernel: [418758.761884] <IRQ> [<ffffffff81046cbd>] ? warn_slowpath_common+0x78/0x8c
> Dec 28 23:16:35 tyne kernel: [418758.869948] [<ffffffff81046d69>] ? warn_slowpath_fmt+0x45/0x4a
> Dec 28 23:16:35 tyne kernel: [418758.977593] [<ffffffff812a6f11>] ? netif_tx_lock+0x40/0x75
> Dec 28 23:16:35 tyne kernel: [418759.082681] [<ffffffff812a7081>] ? dev_watchdog+0xf2/0x151
> Dec 28 23:16:35 tyne kernel: [418759.186240] [<ffffffff81052480>] ? run_timer_softirq+0x19a/0x261
> Dec 28 23:16:35 tyne kernel: [418759.287841] [<ffffffff812a6f8f>] ? netif_tx_unlock+0x49/0x49
> Dec 28 23:16:35 tyne kernel: [418759.387569] [<ffffffff8104c2f8>] ? __do_softirq+0xb9/0x177
> Dec 28 23:16:35 tyne kernel: [418759.486351] [<ffffffff81096529>] ? rcu_needs_cpu+0x50/0x1bb
> Dec 28 23:16:35 tyne kernel: [418759.583008] [<ffffffff8135646c>] ? call_softirq+0x1c/0x30
> Dec 28 23:16:35 tyne kernel: [418759.677333] [<ffffffff8100f8cd>] ? do_softirq+0x3c/0x7b
> Dec 28 23:16:36 tyne kernel: [418759.770142] [<ffffffff8104c560>] ? irq_exit+0x3c/0x99
> Dec 28 23:16:36 tyne kernel: [418759.860906] [<ffffffff8100f5fd>] ? do_IRQ+0x82/0x98
> Dec 28 23:16:36 tyne kernel: [418759.954639] [<ffffffff8134f4ee>] ? common_interrupt+0x6e/0x6e
> Dec 28 23:16:36 tyne kernel: [418760.048124] <EOI> [<ffffffff811ee07d>] ? intel_idle+0xea/0x119
> Dec 28 23:16:36 tyne kernel: [418760.137012] [<ffffffff811ee05c>] ? intel_idle+0xc9/0x119
> Dec 28 23:16:36 tyne kernel: [418760.222705] [<ffffffff8126febd>] ? cpuidle_idle_call+0xec/0x179
> Dec 28 23:16:36 tyne kernel: [418760.306317] [<ffffffff8100d243>] ? cpu_idle+0xa5/0xf2
> Dec 28 23:16:36 tyne kernel: [418760.388391] [<ffffffff816abb36>] ? start_kernel+0x3b8/0x3c3
> Dec 28 23:16:36 tyne kernel: [418760.470137] [<ffffffff816ab140>] ? early_idt_handlers+0x140/0x140
> Dec 28 23:16:36 tyne kernel: [418760.548953] [<ffffffff816ab3c4>] ? x86_64_start_kernel+0x104/0x111
> Dec 28 23:16:36 tyne kernel: [418760.626209] ---[ end trace 25448d4e9ff0e259 ]---
> Dec 28 23:16:37 tyne kernel: [418760.710249] igb 0000:06:00.1: eth1: Reset adapter
> Dec 28 23:16:37 tyne kernel: [418760.814181] igb 0000:06:00.0: eth0: Reset adapter
- and -
> Dec 28 23:16:32 tees kernel: [419013.476706] WARNING: at /build/linux-rrsxby/linux-3.2.51/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
> Dec 28 23:16:33 tees kernel: [419013.591003] Hardware name: X9DRD-iF
> Dec 28 23:16:33 tees kernel: [419013.705052] NETDEV WATCHDOG: eth1 (igb): transmit queue 3 timed out
> Dec 28 23:16:34 tees kernel: [419013.817376] Modules linked in: hmac dlm sctp libcrc32c configfs ip6table_filter ebtable_nat ebtables act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_statistic xt_CT xt_time xt_connlimit xt_realm xt_addrtype iptable_raw xt_comment xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink
_
log xt_multiport xt_mark xt_
> Dec 28 23:16:34 tees kernel: mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding sha1_ssse3 sha1_generic ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache bridge stp loop kvm_intel kvm snd_pcm snd_timer snd i2c_i801 coretemp crc32c_intel iTCO_wdt soundcore ghash_clmulni_intel acpi_cpufreq
(this is as far as that server got before being STONITHed)
Both servers have Supermicro X9DRD-iF motherboards and are running
linux-image-3.2.0-4-amd64 3.2.51-1.
lspci -vvv for one of the ports in question (eth1 on tyne) is:
> 06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
> Subsystem: Super Micro Computer Inc Device 1521
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin B routed to IRQ 17
> Region 0: Memory at fbd00000 (32-bit, non-prefetchable) [size=128K]
> Region 2: I/O ports at d000 [size=32]
> Region 3: Memory at fbdc0000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address: 0000000000000000 Data: 0000
> Masking: 00000000 Pending: 00000000
> Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Latency L0 <4us, L1 <32us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-4e-ae-18
> Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 0
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
> IOVCap: Migration-, Interrupt Message Number: 000
> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
> IOVSta: Migration-
> Initial VFs: 8, Total VFs: 8, Number of VFs: 8, Function Dependency Link: 01
> VF offset: 384, stride: 4, Device ID: 1520
> Supported Page Size: 00000553, System Page Size: 00000001
> Region 0: Memory at fbd60000 (32-bit, non-prefetchable)
> Region 3: Memory at fbd40000 (32-bit, non-prefetchable)
> VF Migration: offset: 00000000, BIR: 0
> Capabilities: [1a0 v1] Transaction Processing Hints
> Device specific mode supported
> Steering table in TPH capability structure
> Capabilities: [1d0 v1] Access Control Services
> ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
> ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
> Kernel driver in use: igb
Please let me know if I can provide any further information.
Best regards,
Chris
--
Chris Boot
Tiger Computing Ltd
"Linux for Business"
Tel: 01600 483 484
Web: http://www.tiger-computing.co.uk
Follow us on Facebook: http://www.facebook.com/TigerComputing
Registered in England. Company number: 3389961
Registered address: Wyastone Business Park,
Wyastone Leys, Monmouth, NP25 3SR
Reply to: