Etch: Stock kernel issues with ATI SB600 SATA controller and 2xSamsung HD403LJ ?
I just ran into some troubles with the IO system of a new server running
Debian Etch (2.6.18-6-vserver-amd64 (Debian 2.6.18.dfsg.1-18etch1)) and
was wondering whether these might be related to some SB600 patches which
might not yet be part of the Debian kernel.
The problem occurs with both disks and even persists after the _entire_
hardware of the server was replaced. This therefore seems to be unlikely
to be a hardware defect of one of the individual
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
( full lspci -vv: http://pastesite.com/326/123 )
two SAMSUNG HD403LJ (FW: CT100-12) in a software raid1
( full smartctl/hdparm output: http://pastesite.com/328/123 )
Complete bootup dmesg output:
The system runs without problems but after a couple of days or weeks of
heavy IO load one of the following two situations can occur.
a) Error during access to a disk, leading to the step by step
degradation of the DMA/PIO mode for the disk.
ata2.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2 frozen
ata2.00: tag 0 cmd 0x61 Emask 0x44 stat 0x40 err 0x0 (timeout)
ata2: soft resetting port
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
ata2.00: configured for UDMA/100
ata2.00: configured for UDMA/66
ata2.00: configured for PIO4
All that happens within one second (according to the syslog timestamps).
Somewhere in between raid1 gives up and drops the disk from the array.
But the disk remains accessible after the degradation and can still be
b) Error during access to disk but the dma/pio mode is not changed. The
disk becomes totally inaccessible till the system is rebooted.
ata1.00: exception Emask 0x40 SAct 0xf SErr 0x800 action 0x2 frozen
ata1.00: tag 0 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
ata1.00: tag 1 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
ata1.00: tag 2 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
ata1.00: tag 3 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0xc SErr 0x0 action 0x2 frozen
ata1.00: tag 2 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1.00: tag 3 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: soft resetting port
sda, sector 362827470
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 362827470
After that the disk can't be accessed in any way.
# smartctl -d ata -a /dev/sda
Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)
# hdparm -I /dev/sda
HDIO_DRIVE_CMD(identify) failed: Input/output error
# hdparm -w /dev/sda
HDIO_DRIVE_RESET failed: Inappropriate ioctl for device
full error output:
Has anyone experienced similar problems in the past?
I noticed that the SB600 driver has undergone some modifications/patches
since 2.6.18. Does anyone know if such driver updates/fixes are usually
back ported into the Debian stock kernel?