Bug#926202: mpt3sas driver stuck in endless resetting loop under load
Package: linux-image-4.19.0-0.bpo.2-amd64
Version: 4.19.16-1~bpo9+1
When testing 4.19 from backports, we ran into an issue which manifests
under heavy I/O load. The driver gets stuck in endless resetting loop,
affecting I/O performance badly.
Affected server uses Supermicro H11DSi-NT motherboard and LSI 3008 HBA
(Supermicro part number is AOC-S3008L-L8I). Six software (Linux MD)
RAID1 arrays are formed from 12 drives (6 rotating, 6 SSDs). The issue
can be triggered reliably by forcing the arrays to resync (with
sync_speed_max set to 600000 for all of them) and running pvmove between
two of the SSD arrays at the same time. After few seconds the server
becomes unresponsive, array resync speed drops to near zero, same for
pvmove progress. Accesing files which are not present in memory (or
saving files) takes several seconds.
For testing purposes, we tried to move all the hard drives into another
server that has the same LSI hardware, but the issue persisted. Neither
of those servers is in production use yet, so we should be able to test
potential solutions and patches easily.
This is logged in dmesg (triple dot indicates same message repeating for
different drives):
[ 242.680063] mpt3sas_cm0: fault_state(0x5862)!
[ 242.680114] mpt3sas_cm0: sending diag reset !!
[ 243.713254] mpt3sas_cm0: diag reset: SUCCESS
[ 243.742277] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default
host page size to 4k
[ 243.897265] mpt3sas_cm0: _base_display_fwpkg_version: complete
[ 243.897639] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.01.00),
ChipRevision(0x02), BiosVersion(08.37.00.00)
[ 243.897705] mpt3sas_cm0: Protocol=(
[ 243.897706] Initiator
[ 243.897732] ,Target
[ 243.897752] ),
[ 243.897770] Capabilities=(
[ 243.897785] TLR
[ 243.897806] ,EEDP
[ 243.897821] ,Snapshot Buffer
[ 243.897837] ,Diag Trace Buffer
[ 243.897859] ,Task Set Full
[ 243.897883] ,NCQ
[ 243.897904] )
[ 243.897988] mpt3sas_cm0: sending port enable !!
[ 251.003196] mpt3sas_cm0: port enable: SUCCESS
[ 251.003376] mpt3sas_cm0: search for end-devices: start
[ 251.003835] scsi target0:0:0: handle(0x000a),
sas_addr(0x50030480180580c0)
[ 251.003886] scsi target0:0:0: enclosure logical
id(0x50030480180580ff), slot(0)
[ 251.003980] scsi target0:0:1: handle(0x000b),
sas_addr(0x50030480180580c1)
[ 251.004029] scsi target0:0:1: enclosure logical
id(0x50030480180580ff), slot(1)
...
[ 251.021639] scsi target0:0:12: handle(0x0016),
sas_addr(0x50030480180580fd)
[ 251.022166] scsi target0:0:12: enclosure logical
id(0x50030480180580ff), slot(12)
[ 251.022740] mpt3sas_cm0: search for end-devices: complete
[ 251.023281] mpt3sas_cm0: search for end-devices: start
[ 251.023817] mpt3sas_cm0: search for PCIe end-devices: complete
[ 251.024365] mpt3sas_cm0: search for expanders: start
[ 251.024934] expander present: handle(0x0009),
sas_addr(0x50030480180580ff)
[ 251.025502] mpt3sas_cm0: search for expanders: complete
[ 251.026038] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[ 251.026086] mpt3sas_cm0: removing unresponding devices: start
[ 251.028140] mpt3sas_cm0: removing unresponding devices: end-devices
[ 251.029484] mpt3sas_cm0: Removing unresponding devices: pcie
end-devices
[ 251.030795] mpt3sas_cm0: removing unresponding devices: expanders
[ 251.032106] mpt3sas_cm0: removing unresponding devices: complete
[ 251.033412] mpt3sas_cm0: scan devices: start
[ 251.035183] mpt3sas_cm0: scan devices: expanders start
[ 251.038769] mpt3sas_cm0: break from expander scan:
ioc_status(0x0022), loginfo(0x310f0400)
[ 251.040140] mpt3sas_cm0: scan devices: expanders complete
[ 251.041496] mpt3sas_cm0: scan devices: end devices start
[ 251.043860] mpt3sas_cm0: break from end device scan:
ioc_status(0x0022), loginfo(0x310f0400)
[ 251.044991] mpt3sas_cm0: scan devices: end devices complete
[ 251.046109] mpt3sas_cm0: scan devices: pcie end devices start
[ 251.047237] mpt3sas_cm0: log_info(0x3003011d): originator(IOP),
code(0x03), sub_code(0x011d)
[ 251.048402] mpt3sas_cm0: log_info(0x3003011d): originator(IOP),
code(0x03), sub_code(0x011d)
[ 251.049505] mpt3sas_cm0: break from pcie end device scan:
ioc_status(0x0021), loginfo(0x3003011d)
[ 251.050319] mpt3sas_cm0: pcie devices: pcie end devices complete
[ 251.050966] mpt3sas_cm0: scan devices: complete
[ 251.503219] sd 0:0:0:0: Power-on or device reset occurred
[ 251.503261] sd 0:0:2:0: Power-on or device reset occurred
...
[ 252.007740] sd 0:0:8:0: Power-on or device reset occurred
One second later the kernel logs
[ 253.080085] mpt3sas_cm0: fault_state(0x5862)!
and the whole process repeats itself. It keeps repeating until the
server is shut down.
lspci info for the HBA is:
02:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios
Logic SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
This issue appears to be similar to one I found in Ubuntu bug tracker -
https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/1810781 -
however the patch mentioned there seems to be already applied to the
backports kernel.
Standard Stretch kernel 4.9 doesn't seem to be affected by this.
Thanks in advance for any suggestions or patches
Reply to: