[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#598793: sas controller resets causes drives to fail under mdadm



Quintin <quintin@quintin.co.nz> writes:

> Package: linux-image-2.6.26-2-xen-amd64
> Version: 2.6.26-24lenny1
> Severity: important
>
> The mpt SAS controller seems to misinterpret messages from the SATA
> drives connected to the SAS controller - causing mdadm to remove them
> from the array. The net result during this event is the array goes
> read-only or worse corrupts data.
>
> This issue has occurred on this version of equipment when running
> S.M.A.R.T. - which may be related?
[..]
> 05:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)


Sure is.  I have seen similar issues with this controller, especially
when running smartctl.

I believe this was fixed in the latest 2.6.32 stable update (2.6.32.23):

commit aaf3b48b50681f779723ea9bb141931739b75c4b
Author: Ryan Kuester <rkuester@kspace.net>
Date:   Mon Apr 26 18:11:54 2010 -0500

    SCSI: mptsas: fix hangs caused by ATA pass-through
    
    commit 2a1b7e575b80ceb19ea50bfa86ce0053ea57181d upstream.
    
    I may have an explanation for the LSI 1068 HBA hangs provoked by ATA
    pass-through commands, in particular by smartctl.
    
    First, my version of the symptoms.  On an LSI SAS1068E B3 HBA running
    01.29.00.00 firmware, with SATA disks, and with smartd running, I'm seeing
    occasional task, bus, and host resets, some of which lead to hard faults of
    the HBA requiring a reboot.  Abusively looping the smartctl command,
    
        # while true; do smartctl -a /dev/sdb > /dev/null; done
    
    dramatically increases the frequency of these failures to nearly one per
    minute.  A high IO load through the HBA while looping smartctl seems to
    improve the chance of a full scsi host reset or a non-recoverable hang.
    
    I reduced what smartctl was doing down to a simple test case which
    causes the hang with a single IO when pointed at the sd interface.  See
    the code at the bottom of this e-mail.  It uses an SG_IO ioctl to issue
    a single pass-through ATA identify device command.  If the buffer
    userspace gives for the read data has certain alignments, the task is
    issued to the HBA but the HBA fails to respond.  If run against the sg
    interface, neither the test code nor smartctl causes a hang.
    
    sd and sg handle the SG_IO ioctl slightly differently.  Unless you
    specifically set a flag to do direct IO, sg passes a buffer of its own,
    which is page-aligned, to the block layer and later copies the result
    into the userspace buffer regardless of its alignment.  sd, on the other
    hand, always does direct IO unless the userspace buffer fails an
    alignment test at block/blk-map.c line 57, in which case a page-aligned
    buffer is created and used for the transfer.
    
    The alignment test currently checks for word-alignment, the default
    setup by scsi_lib.c; therefore, userspace buffers of almost any
    alignment are given directly to the HBA as DMA targets.  The LSI 1068
    hardware doesn't seem to like at least a couple of the alignments which
    cross a page boundary (see the test code below).  Curiously, many
    page-boundary-crossing alignments do work just fine.
    
    So, either the hardware has an bug handling certain alignments or the
    hardware has a stricter alignment requirement than the driver is
    advertising.  If stricter alignment is required, then in no case should
    misaligned buffers from userspace be allowed through without being
    bounced or at least causing an error to be returned.
    
    It seems the mptsas driver could use blk_queue_dma_alignment() to advertise
    a stricter alignment requirement.  If it does, sd does the right thing and
    bounces misaligned buffers (see block/blk-map.c line 57).  The following
    patch to 2.6.34-rc5 makes my symptoms go away.  I'm sure this is the wrong
    place for this code, but it gets my idea across.
    
    Acked-by: Kashyap Desai <Kashyap.Desai@lsi.com>
    Signed-off-by: James Bottomley <James.Bottomley@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>



which was included in Debian linux-image-2.6.32-5-amd64 version
2.6.32-24.  You should check whether the issue is still present with
that version.

Given the age and stability of this driver, I believe it's reasonable to
guess that the issue was present in all 2.6.26 versions, and I guess it
won't be backported unless you do it yourself.



Bjørn



Reply to: