Bug#777683: e1000e driver, empty TX queue after IP drop causes dev_watchdog

To: 777683@bugs.debian.org
Subject: Bug#777683: e1000e driver, empty TX queue after IP drop causes dev_watchdog
From: S Egbert <s.egbert@sbcglobal.net>
Date: Tue, 24 May 2016 21:06:58 -0400
Message-id: <[🔎] 5744FAB2.4060707@sbcglobal.net>
Reply-to: S Egbert <s.egbert@sbcglobal.net>, 777683@bugs.debian.org

I too have the same problem on Debian as 3 others do.

As a former Ethernet driver developer, I noticed that the queue is empty when the interrupt was fired. And that it appeared hung in the Linux qdisc portion at Interrupt context, to a point of having watchdog timer expiring.

My relevant details is:
    Dell OptiPlex 980
    3.16.0-4-amd64
    linux/3.16.7-ckt25-2 (2016-04-08) x86_64
    Intel Gigabit Ethernet 82578DM Gigabit Network Connection (rev 05)

From what I've gathered from the following potentially duplicate bug #798512 and Intel Community Forums:

1 - It isn't CPU-related
2. This error happened in the following Linux kernel versions:
    a. 3.16.0-4-amd64
    b. 3.19.5 (source: Intel communities)
    c. 4.3+70~bpo8+1
    b. 3.16.7-ckt11-1
3. This error does NOT happen in the following Linux kernel versions (take this with a grain of salt, for we haven't a reliable repeatable bug inducement yet):
    a. 3.16.7-ckt20-1+deb8u4
4. Intel driver used but still have error
   b. 3.3.3-NAPI
5. Intel hardware having this problem
a. Intel I217-V (rev 04) (onboard) (has lspci SERR-)
b. Intel 82578DM (rev 05) (onboard) (has lspci SERR+)
c. Intel Corporation 82579V Gigabit Network Connection (rev 05) (onboard)
6. Linux network
   a. eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
   b. eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP mode DEFAULT group default qlen 1000

So far, common thread of the alike problems is the following (more reports will eliminate a few):
1. e1000e driver
2. ip link using 'qdisc' and 'pfifo_fast' option
2. onboard Ethernet (PCI-related?)
3. Starting at Linux 3.16.0
4. IP outgoing packets dropped was non-zero (mostly 32 packets)
4. share similar call stack backtrace:

Bug #777683 call stack backtrace

[  295.041406]  <IRQ>  [<ffffffff8150b405>] ? dump_stack+0x41/0x51
[  295.041417]  [<ffffffff81067797>] ? warn_slowpath_common+0x77/0x90
[  295.041420]  [<ffffffff810677fc>] ? warn_slowpath_fmt+0x4c/0x50
[  295.041425]  [<ffffffff81074777>] ? mod_timer+0x127/0x1e0
[  295.041430]  [<ffffffff8143eb96>] ? dev_watchdog+0x236/0x240
[  295.041433]  [<ffffffff8143e960>] ? dev_graft_qdisc+0x70/0x70
[  295.041436]  [<ffffffff81072ae1>] ? call_timer_fn+0x31/0x100
[  295.041439]  [<ffffffff8143e960>] ? dev_graft_qdisc+0x70/0x70
[  295.041442]  [<ffffffff81074119>] ? run_timer_softirq+0x209/0x2f0
[  295.041445]  [<ffffffff8106c641>] ? __do_softirq+0xf1/0x290
[  295.041448]  [<ffffffff8106ca15>] ? irq_exit+0x95/0xa0
[  295.041451]  [<ffffffff81514455>] ? smp_apic_timer_interrupt+0x45/0x60
[  295.041455]  [<ffffffff8151253d>] ? apic_timer_interrupt+0x6d/0x80
[  295.041456]  <EOI>  [<ffffffff81074a26>] ? get_next_timer_interrupt+0x1d6/0x250
[  295.041465]  [<ffffffff813ddf9f>] ? cpuidle_enter_state+0x4f/0xc0
[  295.041468]  [<ffffffff813ddf98>] ? cpuidle_enter_state+0x48/0xc0
[  295.041472]  [<ffffffff810a7fa8>] ? cpu_startup_entry+0x2f8/0x400
[  295.041475]  [<ffffffff81903071>] ? start_kernel+0x492/0x49d
[  295.041478]  [<ffffffff81902a04>] ? set_init_arg+0x4e/0x4e
[  295.041480]  [<ffffffff81902120>] ? early_idt_handlers+0x120/0x120
[  295.041483]  [<ffffffff8190271f>] ? x86_64_start_kernel+0x14d/0x15c
[  295.041485] ---[ end trace aaf46f7eeccba58f ]---
[  295.041502] e1000e 0000:00:19.0 eth-office: Reset adapter unexpectedly

Intel Community Forums (Intel 3.3.3-NAPI driver):
(source: https://communities.intel.com/message/305442#305442)
<IRQ>
[<ffffffff812e1ac9>] ? dump_stack+0x40/0x57
[<ffffffff81074451>] ? warn_slowpath_common+0x81/0xb0
[<ffffffff810744dc>] ? warn_slowpath_fmt+0x5c/0x80
[<ffffffff814b89e9>] ? dev_watchdog+0x229/0x240
[<ffffffff814b87c0>] ? dev_deactivate_queue.constprop.34+0x60/0x60
[<ffffffff810d6e90>] ? call_timer_fn+0x30/0xf0
[<ffffffff814b87c0>] ? dev_deactivate_queue.constprop.34+0x60/0x60
[<ffffffff810d861d>] ? run_timer_softirq+0x17d/0x2b0
[<ffffffff81078ca7>] ? __do_softirq+0x107/0x270
[<ffffffff81078f46>] ? irq_exit+0x86/0x90
[<ffffffff8158d90e>] ? smp_apic_timer_interrupt+0x3e/0x50
[<ffffffff8158b7a2>] ? apic_timer_interrupt+0x82/0x90
<EOI>
[<ffffffff8145ce08>] ? cpuidle_enter_state+0xe8/0x220
[<ffffffff8145cde3>] ? cpuidle_enter_state+0xc3/0x220
[<ffffffff810b3894>] ? cpu_startup_entry+0x294/0x350
[<ffffffff8104b600>] ? start_secondary+0x150/0x190


Debian Bug #798512

<ffffffff81067797>] ? warn_slowpath_common+0x77/0x90
<ffffffff810677fc>] ? warn_slowpath_fmt+0x4c/0x50
<ffffffff81074777>] ? mod_timer+0x127/0x1e0
<ffffffff8143eb96>] ? dev_watchdog+0x236/0x240
<ffffffff8143e960>] ? dev_graft_qdisc+0x70/0x70
<ffffffff81072ae1>] ? call_timer_fn+0x31/0x100
<ffffffff8143e960>] ? dev_graft_qdisc+0x70/0x70
<ffffffff81074119>] ? run_timer_softirq+0x209/0x2f0
<ffffffff8106c641>] ? __do_softirq+0xf1/0x290
<ffffffff8106ca15>] ? irq_exit+0x95/0xa0

My /var/log/message (3.6.14):
dmesg: e1000e: Intel(R) PRO/1000 Network Driver - 2.3.2-k
dmesg: e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
dmesg: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
May 24 18:44:55 sandbay kernel: [ 840.766377] <IRQ> [<ffffffff8150e835>] ? dump_stack+0x5d/0x78
May 24 18:44:55 sandbay kernel: [ 840.766391] [<ffffffff810677f7>] ? warn_slowpath_common+0x77/0x90
May 24 18:44:55 sandbay kernel: [ 840.766396] [<ffffffff8106785c>] ? warn_slowpath_fmt+0x4c/0x50
May 24 18:44:55 sandbay kernel: [ 840.766410] [<ffffffff81440f86>] ? dev_watchdog+0x236/0x240
May 24 18:44:55 sandbay kernel: [ 840.766418] [<ffffffff81440d50>] ? dev_graft_qdisc+0x70/0x70
May 24 18:44:55 sandbay kernel: [ 840.766424] [<ffffffff81072ba1>] ? call_timer_fn+0x31/0x100
May 24 18:44:55 sandbay kernel: [ 840.766435] [<ffffffff81440d50>] ? dev_graft_qdisc+0x70/0x70
May 24 18:44:55 sandbay kernel: [ 840.766439] [<ffffffff810741d9>] ? run_timer_softirq+0x209/0x2f0
May 24 18:44:55 sandbay kernel: [ 840.766444] [<ffffffff8106c6a1>] ? __do_softirq+0xf1/0x290
May 24 18:44:55 sandbay kernel: [ 840.766452] [<ffffffff8106ca75>] ? irq_exit+0x95/0xa0
May 24 18:44:55 sandbay kernel: [ 840.766457] [<ffffffff81517822>] ? do_IRQ+0x52/0xe0
May 24 18:44:55 sandbay kernel: [ 840.766465] [<ffffffff8151566d>] ? common_interrupt+0x6d/0x6d
May 24 18:44:55 sandbay kernel: [ 840.766467] <EOI> [<ffffffff813e011f>] ? cpuidle_enter_state+0x4f/0xc0
May 24 18:44:55 sandbay kernel: [ 840.766475] [<ffffffff813e0118>] ? cpuidle_enter_state+0x48/0xc0
May 24 18:44:55 sandbay kernel: [ 840.766483] [<ffffffff810a8398>] ? cpu_startup_entry+0x2f8/0x400
May 24 18:44:55 sandbay kernel: [ 840.766488] [<ffffffff81042cbf>] ? start_secondary+0x20f/0x2d0

Some helpful tips for those who do have this same problem is to provide the output of the following shell commands:
- uname -a
- lspci -vv
- dmesg | grep e1000 # not 'grep e1000e', we want to know if conflicts between Intel Eth driver exist
- ip -s link show # we want to know if there are 1 or more Ethernet netdevice
- callstack backtrace (from dmesg or /var/log/message)
- firmware version

Reply to:

Prev by Date: Bug#756900: #756900: nfs-utils: NFS 1.3 fixes NFS systemd integration
Next by Date: LTS kernel in jessie-backports
Previous by thread: Bug#756900: #756900: nfs-utils: NFS 1.3 fixes NFS systemd integration
Next by thread: LTS kernel in jessie-backports
Index(es):
- Date
- Thread