[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#898336: DL380 instability with hpwdt



I have a DL380 G7 which I got for free a few months ago and IO setup in my home lab, I installed Debian 12.5 and got lots of errors in the logs and random freezes a few times a day.
Investigating I came across posts saying that the problem was hpwdt and to blacklist it. Since I did this server has been an absolute beauty with no issues at all.

Happy to run tests for you on weekends, but although I am not a noob on USING Debian (sys admin here), I have no idea of kernel and modules programming, so you may need to tell me exactly what to do to collect data for you.

Cheers
Marcos

inxi
CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max: 2487/1596/3326 MHz
Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%)
Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36

uname -a
Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux

lsmod
Module                  Size  Used by
cpuid                  12288  0
vhost_net              36864  5
vhost                  65536  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    32768  1 vhost_net
tun                    69632  13 vhost_net
bridge                389120  0
stp                    12288  1 bridge
llc                    16384  2 bridge,stp
rfkill                 40960  1
qrtr                   53248  2
cpufreq_powersave      16384  0
amdgpu              12939264  0
amdxcp                 12288  1 amdgpu
drm_exec               12288  1 amdgpu
binfmt_misc            28672  1
gpu_sched              65536  1 amdgpu
drm_buddy              20480  1 amdgpu
ipmi_ssif              45056  0
radeon               1888256  1
intel_powerclamp       16384  0
kvm_intel             413696  32
drm_suballoc_helper    12288  2 amdgpu,radeon
drm_display_helper    266240  2 amdgpu,radeon
kvm                  1343488  21 kvm_intel
cec                    69632  1 drm_display_helper
rc_core                73728  1 cec
drm_ttm_helper         12288  2 amdgpu,radeon
ttm                   102400  3 amdgpu,radeon,drm_ttm_helper
ghash_clmulni_intel    16384  0
drm_kms_helper        253952  3 drm_display_helper,amdgpu,radeon
sha512_ssse3           45056  0
sha256_ssse3           32768  0
sha1_ssse3             32768  0
i2c_algo_bit           12288  2 amdgpu,radeon
video                  77824  2 amdgpu,radeon
wmi                    28672  1 video
aesni_intel           364544  0
crypto_simd            16384  1 aesni_intel
cryptd                 28672  2 crypto_simd,ghash_clmulni_intel
sg                     45056  0
hpilo                  20480  0
joydev                 24576  0
intel_cstate           24576  0
serio_raw              16384  0
evdev                  28672  7
pcspkr                 12288  0
ipmi_si                86016  1
intel_uncore          258048  0
iTCO_wdt               12288  0
intel_pmc_bxt          16384  1 iTCO_wdt
i7core_edac            32768  0
iTCO_vendor_support    12288  1 iTCO_wdt
watchdog               49152  1 iTCO_wdt
acpi_power_meter       24576  0
acpi_cpufreq           32768  0
acpi_ipmi              20480  1 acpi_power_meter
ipmi_devintf           16384  0
ipmi_msghandler        86016  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
button                 24576  0
scsi_dh_alua           24576  1
dm_service_time        12288  0
dm_multipath           45056  1 dm_service_time
coretemp               16384  0
drm                   749568  12 gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp
msr                    12288  0
efi_pstore             12288  0
loop                   40960  0
configfs               69632  1
ip_tables              28672  0
x_tables               53248  1 ip_tables
autofs4                57344  2
ext4                 1130496  7
crc16                  12288  1 ext4
mbcache                16384  1 ext4
jbd2                  196608  1 ext4
efivarfs               28672  0
raid10                 73728  0
raid456               196608  0
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
xor                    20480  1 async_xor
raid6_pq              122880  3 async_pq,raid456,async_raid6_recov
libcrc32c              12288  1 raid456
crc32c_generic         12288  0
raid1                  61440  0
raid0                  24576  0
md_mod                225280  4 raid1,raid10,raid0,raid456
dm_mod                208896  25 dm_multipath
hid_generic            12288  0
usbhid                 77824  0
hid                   253952  2 usbhid,hid_generic
qla2xxx              1171456  2
sd_mod                 81920  8
nvme_fc                53248  1 qla2xxx
nvme_fabrics           32768  1 nvme_fc
nvme_core             192512  2 nvme_fc,nvme_fabrics
t10_pi                 20480  2 sd_mod,nvme_core
uhci_hcd               61440  0
crc64_rocksoft         16384  1 t10_pi
ehci_pci               16384  0
crc64                  16384  1 crc64_rocksoft
hpsa                  122880  6
ehci_hcd              110592  1 ehci_pci
crc_t10dif             16384  1 t10_pi
crct10dif_generic      12288  0
scsi_transport_fc     102400  1 qla2xxx
scsi_transport_sas     57344  1 hpsa
usbcore               401408  4 ehci_pci,usbhid,ehci_hcd,uhci_hcd
psmouse               208896  0
scsi_mod              319488  8 scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg
crct10dif_pclmul       12288  1
crc32_pclmul           12288  0
crc32c_intel           16384  14
bnx2                  118784  0
lpc_ich                28672  0
usb_common             16384  3 usbcore,ehci_hcd,uhci_hcd
crct10dif_common       12288  3 crct10dif_generic,crc_t10dif,crct10dif_pclmul
scsi_common            16384  5 scsi_mod,sd_mod,qla2xxx,hpsa,sg


lsblk  
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                      8:0    0 273.4G  0 disk  
├─sda1                   8:1    0   487M  0 part /boot
├─sda2                   8:2    0     1K  0 part  
└─sda5                   8:5    0 272.9G  0 part  
 ├─Earth2--vg-root    254:0    0  43.3G  0 lvm  /
 ├─Earth2--vg-var     254:1    0   9.3G  0 lvm  /var
 ├─Earth2--vg-swap_1  254:2    0   976M  0 lvm  [SWAP]
 ├─Earth2--vg-tmp     254:3    0   1.9G  0 lvm  /tmp
 └─Earth2--vg-home    254:4    0 169.5G  0 lvm  /home
sdb                      8:16   0  17.3T  0 disk  
└─sdb1                   8:17   0  17.3T  0 part  
sdc                      8:32   0  17.3T  0 disk  
└─sdc1                   8:33   0  17.3T  0 part  
 ├─Oort-VMDisks       254:5    0     7T  0 lvm  /Oort/VMDisks
 └─Oort-NextcloudDisk 254:6    0     5T  0 lvm  /Oort/NextcloudDisk

On Thu, 10 Oct 2024 at 12:44, Jerry Hoemann <jerry.hoemann@hpe.com> wrote:
On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote:
> Hi Jerry,
>
> The Debian kernel team received a number of reports over the past few
> years of instability of the Proliant DL380 G7 and DL380p G8, seemingly
> related to the hpwdt driver (in that this goes away if it is not
> loaded).  These reports can be seen at
> <https://bugs.debian.org/898336>.
>
> The instability has been seen with kernel versions ranging from 4.16 to
> 6.1.y, including after the backport of commit dced0b3e51dd
> "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO").
>
> I can see that hpwdt seems to be used for error reporting so it's not
> clear to me whether these are problems caused by the driver, or the
> driver is only reporting that something bad happened.
>
> Do you have any ideas about what's going wrong here?  Is there
> something odd about these models that needs to be handled in hpwdt, or
> are they just popular models?

Hi Ben,

There are a couple things that come to mind.

As you mentioned,  hpwdt is used for error containment on ProLiants.
(Especially on the older generations) Errors would be raised as
NMI and the expectation was that hpwdt would handle the NMI and
initiate a kdump.  I have seen cases where shutting down file
systems can raise PCIe errors which would be transmitted to the
SUT as NMI and handled by hpwdt.

The second issue is that systemd enables WDT (not just hpwdt) during
shutdown.  This is to handle the case where shutdown hangs.  The WDT
is supposed to break the system out of such situations.  The default
timeout is 10 minutes:
        /etc/systemd/system.conf:
        #RebootWatchdogSec=10min
(note, I'm not a Debian user, but i believe the systemd behavior is the
same on Debian as it is on rhel/sles.)

While a ten minute delay to shutdown would be fairly obvious if you're
doing interactive testing, it might not be noticed if the testing is
automated.

To determine if either of the above is happening, you can:

o) do the testing interactively and time the test.  Does the NMI come in
roughly 10 minutes after the shutdown?

o) Check the IEL and IML on the iLO web interface.  Do you see any
errors reported during the shutdown?


Questions:
1) The Debian bug above mentions only Gen 7 and 8 systems.
   Are you seeing this issue on other ProLiant systems?

2) You mentioned back-porting commit dced0b3e51dd.  Does your
   drivers/watchdog/hpwdt.c source match upstream Linux? Or
   do you cherry pick patches?  (sorry, not knowing Debian,
   I don't know how find/navigate your kernel source.)

Please let me know what you find.


Jerry


--

-----------------------------------------------------------------------------
Jerry Hoemann                  Software Engineer   Hewlett Packard Enterprise
-----------------------------------------------------------------------------


--
Marcos R Carot

Reply to: