Re: e1000e driver Network Card Detected Hardware Unit Hang

To: Jamie <darkshad9999@gmail.com>
Cc: debian-user <debian-user@lists.debian.org>
Subject: Re: e1000e driver Network Card Detected Hardware Unit Hang
From: Sirius <sirius@trudheim.com>
Date: Wed, 17 Apr 2024 07:24:09 +0200
Message-id: <[🔎] Zh9c-dkd4MnANO5R@acer.trudheim.com>
In-reply-to: <da30d37a-760f-442a-a720-ffa14c6f773f@gmail.com>
References: <[🔎] 93363334-7c35-4901-9c98-0f34343002f1@gmail.com> <[🔎] Zh35I5AVXyR53juA@acer.trudheim.com> <97110b49-1fcd-4dc6-821c-99d9c4c40cb8@gmail.com> <Zh4cjBDkiURM6cY2@photonic.trudheim.com> <da30d37a-760f-442a-a720-ffa14c6f773f@gmail.com>

In days of yore (Tue, 16 Apr 2024), Jamie thus quoth: 
> Look this is a kernel bug and Debian needs to
> fix this! Don't give me any of this crap about upstream
> this is a bug with the Debian Kernel!

Pay attention, because I am now in Support Mode as a former Principal
Technical Account Manager for Red Hat.


No, this is not necessarily a kernel bug. It can be a hardware bug and it
is plausible it can not be solved with a driver work-around.

You are hitting a problem and you want someone else to fix it for you. The
answer may simply be that you need to replace the NIC with something else.

FWIW, I have these Intel NICs in my two NUCs and they are functioning
fine. With Debian 12.5 and the latest updates.

$ lspci -v -s 00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection I219-V (rev 21)
	DeviceName:  LAN
	Subsystem: Intel Corporation Ethernet Connection I219-V
	Flags: bus master, fast devsel, latency 0, IRQ 123, IOMMU group 7
	Memory at df100000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: e1000e
	Kernel modules: e1000e

The revision of the NIC may determine whether you have *hardware* problems
or not.

> This needs to be fixed!

Quick answer: replace the NIC. And do some groundwork to determine if the
NIC you replace it with has issues you should be aware of or not.

> I have already tried disabling the offloads and it does
> not work.

The specific offloads seemed to be the CRC related ones.

# ethtool -k eno1
Features for eno1:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]

Note: when you disable these, throughput can drop sharply.

The other setting suggested was to hike the TX ringbuffer.

# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX:		4096
RX Mini:	n/a
RX Jumbo:	n/a
TX:		4096
Current hardware settings:
RX:		256
RX Mini:	n/a
RX Jumbo:	n/a
TX:		256
RX Buf Len:		n/a
CQE Size:		n/a
TX Push:	off
TCP data split:	n/a

# ethtool -G eno1 tx 2048 rx 2048
# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX:		4096
RX Mini:	n/a
RX Jumbo:	n/a
TX:		4096
Current hardware settings:
RX:		2048
RX Mini:	n/a
RX Jumbo:	n/a
TX:		2048
RX Buf Len:		n/a
CQE Size:		n/a
TX Push:	off
TCP data split:	n/a

The reason the ringbuffers are important is that the kernel and the OS can
construct packets faster in bursts than the NIC can handle, so the OS can
queue up packets in the ringbuffer and the NIC can asynchronously pick the
packets from the buffer and send them across the wire. If the ringbuffers
are set too small, they will overflow and you will get overflow errors.

# ethtool -S eno1
NIC statistics:
     rx_packets: 24463
     tx_packets: 6358
     rx_bytes: 3093199
     tx_bytes: 669733
     rx_broadcast: 8044
     tx_broadcast: 9
     rx_multicast: 10434
     tx_multicast: 2510
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0     <<<< If buffers are set too small, this increases
     multicast: 10434
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 9
     rx_flow_control_xoff: 9
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_csum_offload_good: 8539    <<<< If you have issues with checksum
     rx_csum_offload_errors: 0     <<<< offload, check these
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
     rx_hwtstamp_cleared: 0
     uncorr_ecc_errors: 0
     corr_ecc_errors: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0

> It isn't the cable either I have tried different cables it
> still happens! This is an issue with the Kernel module for
> the e1000e NIC card.

Excellent data-point, you have ruled out whether the cable is faulty or
not. But your assumption that this is the kernel module that is broken
is still faulty.

Provably, I am running the same type of NIC (albeit a different revision)
with the same driver and I do not observe any issues. Thus, leveraging
Occam's Razor, it follows that scrutinising your particular NIC is in
order.

> This is a bug with the kernel that needs to be fixed in Debian!

That is not certain. Please refrain from making such statements. You have
a problem and you want it fixed, but you are deaf to what you might need
to consider to get the problem resolved. So you may end up living with
this problem for longer than you have to.

> I have already replaced it but this bug needs to be fixed
> by the Debian kernel team!

Replaced *what*? The NIC? What did you replace it with? The cable?

Troubleshooting tips:
 1) Remain calm. When agitated, mistakes happens.
 2) Document every step.
 3) Change one thing at a time
 4) Record the outcome of the changes
 5) Go back to 3) until you can form a hypothesis

You have tried some things and drawn a conclusion, yet that conclusion may
be incorrect. Be open to the fact you may be wrong and that the answer
could be something other than what you think/want it to be.

-- 
Kind regards,

/S

Reply to:

References:
- e1000e driver Network Card Detected Hardware Unit Hang
  - From: Jamie <darkshad9999@gmail.com>
- Re: e1000e driver Network Card Detected Hardware Unit Hang
  - From: Sirius <sirius@trudheim.com>

Prev by Date: Re: Automatic reboot on kernel crash in Debian 12 - how?
Next by Date: is security.debian.org broken
Previous by thread: Re: e1000e driver Network Card Detected Hardware Unit Hang
Next by thread: Automatic reboot on kernel crash in Debian 12 - how?
Index(es):
- Date
- Thread