[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#926202: mpt3sas driver stuck in endless resetting loop under load



Package: linux-image-4.19.0-0.bpo.2-amd64
Version: 4.19.16-1~bpo9+1

When testing 4.19 from backports, we ran into an issue which manifests under heavy I/O load. The driver gets stuck in endless resetting loop, affecting I/O performance badly.

Affected server uses Supermicro H11DSi-NT motherboard and LSI 3008 HBA (Supermicro part number is AOC-S3008L-L8I). Six software (Linux MD) RAID1 arrays are formed from 12 drives (6 rotating, 6 SSDs). The issue can be triggered reliably by forcing the arrays to resync (with sync_speed_max set to 600000 for all of them) and running pvmove between two of the SSD arrays at the same time. After few seconds the server becomes unresponsive, array resync speed drops to near zero, same for pvmove progress. Accesing files which are not present in memory (or saving files) takes several seconds.

For testing purposes, we tried to move all the hard drives into another server that has the same LSI hardware, but the issue persisted. Neither of those servers is in production use yet, so we should be able to test potential solutions and patches easily.

This is logged in dmesg (triple dot indicates same message repeating for different drives):

[  242.680063] mpt3sas_cm0: fault_state(0x5862)!
[  242.680114] mpt3sas_cm0: sending diag reset !!
[  243.713254] mpt3sas_cm0: diag reset: SUCCESS
[ 243.742277] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[  243.897265] mpt3sas_cm0: _base_display_fwpkg_version: complete
[ 243.897639] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.01.00), ChipRevision(0x02), BiosVersion(08.37.00.00)
[  243.897705] mpt3sas_cm0: Protocol=(
[  243.897706] Initiator
[  243.897732] ,Target
[  243.897752] ),
[  243.897770] Capabilities=(
[  243.897785] TLR
[  243.897806] ,EEDP
[  243.897821] ,Snapshot Buffer
[  243.897837] ,Diag Trace Buffer
[  243.897859] ,Task Set Full
[  243.897883] ,NCQ
[  243.897904] )
[  243.897988] mpt3sas_cm0: sending port enable !!
[  251.003196] mpt3sas_cm0: port enable: SUCCESS
[  251.003376] mpt3sas_cm0: search for end-devices: start
[ 251.003835] scsi target0:0:0: handle(0x000a), sas_addr(0x50030480180580c0) [ 251.003886] scsi target0:0:0: enclosure logical id(0x50030480180580ff), slot(0) [ 251.003980] scsi target0:0:1: handle(0x000b), sas_addr(0x50030480180580c1) [ 251.004029] scsi target0:0:1: enclosure logical id(0x50030480180580ff), slot(1)
...
[ 251.021639] scsi target0:0:12: handle(0x0016), sas_addr(0x50030480180580fd) [ 251.022166] scsi target0:0:12: enclosure logical id(0x50030480180580ff), slot(12)
[  251.022740] mpt3sas_cm0: search for end-devices: complete
[  251.023281] mpt3sas_cm0: search for end-devices: start
[  251.023817] mpt3sas_cm0: search for PCIe end-devices: complete
[  251.024365] mpt3sas_cm0: search for expanders: start
[ 251.024934] expander present: handle(0x0009), sas_addr(0x50030480180580ff)
[  251.025502] mpt3sas_cm0: search for expanders: complete
[  251.026038] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[  251.026086] mpt3sas_cm0: removing unresponding devices: start
[  251.028140] mpt3sas_cm0: removing unresponding devices: end-devices
[ 251.029484] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[  251.030795] mpt3sas_cm0: removing unresponding devices: expanders
[  251.032106] mpt3sas_cm0: removing unresponding devices: complete
[  251.033412] mpt3sas_cm0: scan devices: start
[  251.035183] mpt3sas_cm0:     scan devices: expanders start
[ 251.038769] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[  251.040140] mpt3sas_cm0:     scan devices: expanders complete
[  251.041496] mpt3sas_cm0:     scan devices: end devices start
[ 251.043860] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[  251.044991] mpt3sas_cm0:     scan devices: end devices complete
[  251.046109] mpt3sas_cm0:     scan devices: pcie end devices start
[ 251.047237] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [ 251.048402] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [ 251.049505] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
[  251.050319] mpt3sas_cm0:     pcie devices: pcie end devices complete
[  251.050966] mpt3sas_cm0: scan devices: complete
[  251.503219] sd 0:0:0:0: Power-on or device reset occurred
[  251.503261] sd 0:0:2:0: Power-on or device reset occurred
...
[  252.007740] sd 0:0:8:0: Power-on or device reset occurred

One second later the kernel logs

[  253.080085] mpt3sas_cm0: fault_state(0x5862)!

and the whole process repeats itself. It keeps repeating until the server is shut down.

lspci info for the HBA is:
02:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)

This issue appears to be similar to one I found in Ubuntu bug tracker - https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/1810781 - however the patch mentioned there seems to be already applied to the backports kernel.

Standard Stretch kernel 4.9 doesn't seem to be affected by this.

Thanks in advance for any suggestions or patches


Reply to: