[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

(S)ATA exceptions, disk frozen



Hi,

we have about 60 opteron servers running sarge amd64 with a
2.6.20.3 vanilla kernel (previously they were running 2.6.18.1;
I installed the newer kernel in the hope that the problem would disappear).

At a rate of about 1-2 per week the SATA disk in a server freezes and
I have to reboot. Because of the statistical nature of the effect (and since
smartctl -a doesn't display any errors) I conclude that this is not a hardware
issue but a software problem.

Any idea how to narrow down the problem?

Thanks, Thomas

-------------------------------------------------------------------------

Disk: Western Digital SATA 250GB
Device Model:     WDC WD2500YS-01SHB0
Firmware Version: 20.06C03

The board is a TYAN S3993 Thunder h2000M with ServerWorks BCM5780 (HT2000)
chipset.

Here is a typical error log:

Mar 29 09:07:24 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 
0x2 frozen
Mar 29 09:07:24 ata1.00: (BMDMA stat 0x61)
Mar 29 09:07:24 ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
Mar 29 09:07:31 ata1: port is slow to respond, please be patient
Mar 29 09:07:54 ata1: port failed to respond (30 secs)
Mar 29 09:07:54 ata1: soft resetting port
Mar 29 09:08:01 ata1: port is slow to respond, please be patient
Mar 29 09:08:24 ata1: port failed to respond (30 secs)
Mar 29 09:08:25 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 29 09:08:25 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:25 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
Mar 29 09:08:54 ata1.00: qc timeout (cmd 0xec)
Mar 29 09:08:54 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Mar 29 09:08:54 ata1.00: revalidation failed (errno=-5)
Mar 29 09:08:55 ata1: failed to recover some devices, retrying in 5 secs
Mar 29 09:08:59 ata1: hard resetting port
Mar 29 09:09:07 ata1: port is slow to respond, please be patient
Mar 29 09:09:30 ata1: port failed to respond (30 secs)
Mar 29 09:09:31 ata1: COMRESET failed (device not ready)
Mar 29 09:09:31 ata1: hardreset failed, retrying in 5 secs
Mar 29 09:09:35 ata1: hard resetting port
Mar 29 09:09:42 ata1: port is slow to respond, please be patient
Mar 29 09:10:05 ata1: port failed to respond (30 secs)
Mar 29 09:10:05 ata1: COMRESET failed (device not ready)
Mar 29 09:10:05 ata1: hardreset failed, retrying in 5 secs
Mar 29 09:10:10 ata1: hard resetting port
Mar 29 09:10:18 ata1: port is slow to respond, please be patient
Mar 29 09:10:41 ata1: port failed to respond (30 secs)
Mar 29 09:10:41 ata1: COMRESET failed (device not ready)
Mar 29 09:10:41 ata1: reset failed, giving up 
ar 29 09:10:41 
Mar 29 09:10:41 ata1.00: disabled
Mar 29 09:10:41 ata1: EH complete
Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:41 end_request: I/O error, dev sda, sector 8059914
Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14590970
Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 823825
Mar 29 09:10:41 lost page write due to I/O error on sda2
Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14489178
Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811101
Mar 29 09:10:41 lost page write due to I/O error on sda2
Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14489762
Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811174
Mar 29 09:10:41 lost page write due to I/O error on sda2
Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811175
Mar 29 09:10:42 lost page write due to I/O error on sda2
Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:42 end_request: I/O error, dev sda, sector 14488442
Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 811009
Mar 29 09:10:42 lost page write due to I/O error on sda2
Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:42 end_request: I/O error, dev sda, sector 14561994
Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 820203
Mar 29 09:10:42 lost page write due to I/O error on sda2
Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000
Mar 29 09:10:42 end_request: I/O error, dev sda, sector 8060026
Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 7457
Mar 29 09:10:42 lost page write due to I/O error on sda2
Mar 29 09:10:42 Aborting journal on device sda2.



Reply to: