Bug#898336: DL380 instability with hpwdt
I have a DL380 G7 which I got for free a few months ago and IO setup in my home lab, I installed Debian 12.5 and got lots of errors in the logs and random freezes a few times a day.
Investigating I came across posts saying that the problem was hpwdt and to blacklist it. Since I did this server has been an absolute beauty with no issues at all.
Happy to run tests for you on weekends, but although I am not a noob on USING Debian (sys admin here), I have no idea of kernel and modules programming, so you may need to tell me exactly what to do to collect data for you.
Cheers
Marcos
inxi
CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max: 2487/1596/3326 MHz
Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%)
Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36
uname -a
Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux
lsmod
Module Size Used by
cpuid 12288 0
vhost_net 36864 5
vhost 65536 1 vhost_net
vhost_iotlb 16384 1 vhost
tap 32768 1 vhost_net
tun 69632 13 vhost_net
bridge 389120 0
stp 12288 1 bridge
llc 16384 2 bridge,stp
rfkill 40960 1
qrtr 53248 2
cpufreq_powersave 16384 0
amdgpu 12939264 0
amdxcp 12288 1 amdgpu
drm_exec 12288 1 amdgpu
binfmt_misc 28672 1
gpu_sched 65536 1 amdgpu
drm_buddy 20480 1 amdgpu
ipmi_ssif 45056 0
radeon 1888256 1
intel_powerclamp 16384 0
kvm_intel 413696 32
drm_suballoc_helper 12288 2 amdgpu,radeon
drm_display_helper 266240 2 amdgpu,radeon
kvm 1343488 21 kvm_intel
cec 69632 1 drm_display_helper
rc_core 73728 1 cec
drm_ttm_helper 12288 2 amdgpu,radeon
ttm 102400 3 amdgpu,radeon,drm_ttm_helper
ghash_clmulni_intel 16384 0
drm_kms_helper 253952 3 drm_display_helper,amdgpu,radeon
sha512_ssse3 45056 0
sha256_ssse3 32768 0
sha1_ssse3 32768 0
i2c_algo_bit 12288 2 amdgpu,radeon
video 77824 2 amdgpu,radeon
wmi 28672 1 video
aesni_intel 364544 0
crypto_simd 16384 1 aesni_intel
cryptd 28672 2 crypto_simd,ghash_clmulni_intel
sg 45056 0
hpilo 20480 0
joydev 24576 0
intel_cstate 24576 0
serio_raw 16384 0
evdev 28672 7
pcspkr 12288 0
ipmi_si 86016 1
intel_uncore 258048 0
iTCO_wdt 12288 0
intel_pmc_bxt 16384 1 iTCO_wdt
i7core_edac 32768 0
iTCO_vendor_support 12288 1 iTCO_wdt
watchdog 49152 1 iTCO_wdt
acpi_power_meter 24576 0
acpi_cpufreq 32768 0
acpi_ipmi 20480 1 acpi_power_meter
ipmi_devintf 16384 0
ipmi_msghandler 86016 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
button 24576 0
scsi_dh_alua 24576 1
dm_service_time 12288 0
dm_multipath 45056 1 dm_service_time
coretemp 16384 0
drm 749568 12 gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp
msr 12288 0
efi_pstore 12288 0
loop 40960 0
configfs 69632 1
ip_tables 28672 0
x_tables 53248 1 ip_tables
autofs4 57344 2
ext4 1130496 7
crc16 12288 1 ext4
mbcache 16384 1 ext4
jbd2 196608 1 ext4
efivarfs 28672 0
raid10 73728 0
raid456 196608 0
async_raid6_recov 20480 1 raid456
async_memcpy 16384 2 raid456,async_raid6_recov
async_pq 16384 2 raid456,async_raid6_recov
async_xor 16384 3 async_pq,raid456,async_raid6_recov
async_tx 16384 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
xor 20480 1 async_xor
raid6_pq 122880 3 async_pq,raid456,async_raid6_recov
libcrc32c 12288 1 raid456
crc32c_generic 12288 0
raid1 61440 0
raid0 24576 0
md_mod 225280 4 raid1,raid10,raid0,raid456
dm_mod 208896 25 dm_multipath
hid_generic 12288 0
usbhid 77824 0
hid 253952 2 usbhid,hid_generic
qla2xxx 1171456 2
sd_mod 81920 8
nvme_fc 53248 1 qla2xxx
nvme_fabrics 32768 1 nvme_fc
nvme_core 192512 2 nvme_fc,nvme_fabrics
t10_pi 20480 2 sd_mod,nvme_core
uhci_hcd 61440 0
crc64_rocksoft 16384 1 t10_pi
ehci_pci 16384 0
crc64 16384 1 crc64_rocksoft
hpsa 122880 6
ehci_hcd 110592 1 ehci_pci
crc_t10dif 16384 1 t10_pi
crct10dif_generic 12288 0
scsi_transport_fc 102400 1 qla2xxx
scsi_transport_sas 57344 1 hpsa
usbcore 401408 4 ehci_pci,usbhid,ehci_hcd,uhci_hcd
psmouse 208896 0
scsi_mod 319488 8 scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg
crct10dif_pclmul 12288 1
crc32_pclmul 12288 0
crc32c_intel 16384 14
bnx2 118784 0
lpc_ich 28672 0
usb_common 16384 3 usbcore,ehci_hcd,uhci_hcd
crct10dif_common 12288 3 crct10dif_generic,crc_t10dif,crct10dif_pclmul
scsi_common 16384 5 scsi_mod,sd_mod,qla2xxx,hpsa,sg
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 273.4G 0 disk
├─sda1 8:1 0 487M 0 part /boot
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 272.9G 0 part
├─Earth2--vg-root 254:0 0 43.3G 0 lvm /
├─Earth2--vg-var 254:1 0 9.3G 0 lvm /var
├─Earth2--vg-swap_1 254:2 0 976M 0 lvm [SWAP]
├─Earth2--vg-tmp 254:3 0 1.9G 0 lvm /tmp
└─Earth2--vg-home 254:4 0 169.5G 0 lvm /home
sdb 8:16 0 17.3T 0 disk
└─sdb1 8:17 0 17.3T 0 part
sdc 8:32 0 17.3T 0 disk
└─sdc1 8:33 0 17.3T 0 part
├─Oort-VMDisks 254:5 0 7T 0 lvm /Oort/VMDisks
└─Oort-NextcloudDisk 254:6 0 5T 0 lvm /Oort/NextcloudDisk
On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote:
> Hi Jerry,
>
> The Debian kernel team received a number of reports over the past few
> years of instability of the Proliant DL380 G7 and DL380p G8, seemingly
> related to the hpwdt driver (in that this goes away if it is not
> loaded). These reports can be seen at
> <https://bugs.debian.org/898336>.
>
> The instability has been seen with kernel versions ranging from 4.16 to
> 6.1.y, including after the backport of commit dced0b3e51dd
> "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO").
>
> I can see that hpwdt seems to be used for error reporting so it's not
> clear to me whether these are problems caused by the driver, or the
> driver is only reporting that something bad happened.
>
> Do you have any ideas about what's going wrong here? Is there
> something odd about these models that needs to be handled in hpwdt, or
> are they just popular models?
Hi Ben,
There are a couple things that come to mind.
As you mentioned, hpwdt is used for error containment on ProLiants.
(Especially on the older generations) Errors would be raised as
NMI and the expectation was that hpwdt would handle the NMI and
initiate a kdump. I have seen cases where shutting down file
systems can raise PCIe errors which would be transmitted to the
SUT as NMI and handled by hpwdt.
The second issue is that systemd enables WDT (not just hpwdt) during
shutdown. This is to handle the case where shutdown hangs. The WDT
is supposed to break the system out of such situations. The default
timeout is 10 minutes:
/etc/systemd/system.conf:
#RebootWatchdogSec=10min
(note, I'm not a Debian user, but i believe the systemd behavior is the
same on Debian as it is on rhel/sles.)
While a ten minute delay to shutdown would be fairly obvious if you're
doing interactive testing, it might not be noticed if the testing is
automated.
To determine if either of the above is happening, you can:
o) do the testing interactively and time the test. Does the NMI come in
roughly 10 minutes after the shutdown?
o) Check the IEL and IML on the iLO web interface. Do you see any
errors reported during the shutdown?
Questions:
1) The Debian bug above mentions only Gen 7 and 8 systems.
Are you seeing this issue on other ProLiant systems?
2) You mentioned back-porting commit dced0b3e51dd. Does your
drivers/watchdog/hpwdt.c source match upstream Linux? Or
do you cherry pick patches? (sorry, not knowing Debian,
I don't know how find/navigate your kernel source.)
Please let me know what you find.
Jerry
--
-----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett Packard Enterprise
-----------------------------------------------------------------------------
--
Reply to: