[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

hard reset sata link when scrubbing array



Hi,

when scrubbing an md array yesterday (do not know who triggered that yet but 
that's something I hope to cover in another thread...) I got lots of these:

Dec  2 01:38:45 xerxes kernel: [4942064.241510] ata6: hard resetting link
Dec  2 01:38:45 xerxes kernel: [4942064.731991] ata6: SATA link up 1.5 Gbps 
(SStatus 113 SControl 310)
Dec  2 01:38:45 xerxes kernel: [4942064.743759] ata6.00: configured for 
UDMA/33
Dec  2 01:38:45 xerxes kernel: [4942064.755973] ata6: EH complete
Dec  2 01:39:11 xerxes kernel: [4942090.846069] ata6: hard resetting link
Dec  2 01:39:12 xerxes kernel: [4942091.334346] ata6: SATA link up 1.5 Gbps 
(SStatus 113 SControl 310)
Dec  2 01:39:12 xerxes kernel: [4942091.346117] ata6.00: configured for 
UDMA/33
Dec  2 01:39:12 xerxes kernel: [4942091.358342] ata6: EH complete
Dec  2 01:39:37 xerxes kernel: [4942116.461263] ata6: hard resetting link
Dec  2 01:39:37 xerxes kernel: [4942116.949209] ata6: SATA link up 1.5 Gbps 
(SStatus 113 SControl 310)
Dec  2 01:39:37 xerxes kernel: [4942116.961046] ata6.00: configured for 
UDMA/33
Dec  2 01:39:37 xerxes kernel: [4942116.973225] ata6: EH complete
Dec  2 01:39:45 xerxes kernel: [4942124.317211] ata6: hard resetting link
Dec  2 01:39:45 xerxes kernel: [4942124.805185] ata6: SATA link up 1.5 Gbps 
(SStatus 113 SControl 310)
Dec  2 01:39:45 xerxes kernel: [4942124.816985] ata6.00: configured for 
UDMA/33
Dec  2 01:39:45 xerxes kernel: [4942124.829171] ata6: EH complete
Dec  2 01:39:50 xerxes kernel: [4942129.661310] ata6: hard resetting link
Dec  2 01:39:51 xerxes kernel: [4942130.150457] ata6: SATA link up 1.5 Gbps 
(SStatus 113 SControl 310)
Dec  2 01:39:51 xerxes kernel: [4942130.162256] ata6.00: configured for 
UDMA/33
Dec  2 01:39:51 xerxes kernel: [4942130.174443] ata6: EH complete
Dec  2 01:39:55 xerxes kernel: [4942134.652582] ata6: hard resetting link

etc etc. Kept repeating until the check was over, roughly every 5-30 seconds. 
The array check completed fine, and when I force-fed the array some test data 
I saw no trace of link resets in messages.

I rather suspect the chipset or cabling to be the flaky part, when the system 
was all fresh I had quite a few issues of that kind until I pimped the 
chipset cooling. Cables are standard sATA cables like they come with any 
motherboard, 50cm in length. 

Then again that particular disk did log some errors in the SMART journal:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  
Always       -       4536
  2 Throughput_Performance  0x0026   053   053   000    Old_age   
Always       -       17969
  3 Spin_Up_Time            0x0023   061   060   025    Pre-fail  
Always       -       11994
  4 Start_Stop_Count        0x0032   094   094   000    Old_age   
Always       -       6880
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  
Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   
Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   
Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   
Always       -       9953
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   
Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   
Always       -       470
191 G-Sense_Error_Rate      0x0022   091   091   000    Old_age   
Always       -       93015
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   
Always       -       0
194 Temperature_Celsius     0x0002   064   053   000    Old_age   
Always       -       26 (Min/Max 15/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   
Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   
Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   
Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   099   099   000    Old_age   
Always       -       760
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   
Always       -       307
223 Load_Retry_Count        0x0032   252   252   000    Old_age   
Always       -       0
225 Load_Cycle_Count        0x0032   097   097   000    Old_age   
Always       -       38029

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 2804 hours (116 days + 20 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 0a      00:01:24.159  IDENTIFY DEVICE
  27 00 00 00 00 00 e0 0a      00:01:24.159  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 0a      00:01:24.159  IDENTIFY DEVICE
  00 00 00 00 00 00 00 0a      00:01:24.159  NOP [Abort queued commands]
  00 00 30 bf fc 3e 40 0a      00:01:24.158  NOP [Abort queued commands]

Error 4 occurred at disk power-on lifetime: 2804 hours (116 days + 20 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 0a      00:01:24.159  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 0a      00:01:24.159  IDENTIFY DEVICE
  00 00 00 00 00 00 00 0a      00:01:24.159  NOP [Abort queued commands]
  00 00 30 bf fc 3e 40 0a      00:01:24.158  NOP [Abort queued commands]
  60 00 70 4f 94 7e 40 0a      00:01:22.644  READ FPDMA QUEUED

Error 3 occurred at disk power-on lifetime: 2658 hours (110 days + 18 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 0a      00:00:08.945  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 0a      00:00:08.945  IDENTIFY DEVICE
  00 00 01 01 00 00 40 0a      00:00:08.945  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:08.945  NOP [Abort queued commands]
  25 00 08 3f 00 30 e0 0a      00:00:08.944  READ DMA EXT

Error 2 occurred at disk power-on lifetime: 2356 hours (98 days + 4 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 0a      00:00:27.230  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 0a      00:00:27.230  IDENTIFY DEVICE
  00 08 00 00 00 00 00 08      00:00:27.230  NOP [Reserved subcommand]
  00 08 0c d7 37 ba e0 08      00:00:27.229  NOP [Reserved subcommand]
  60 08 10 2f 74 eb 40 08      00:00:27.161  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 2099 hours (87 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 0a      00:00:04.659  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 0a      00:00:04.659  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 0a      00:00:04.659  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 0a      00:00:04.659  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 0a      00:00:04.659  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8084         -
# 2  Short offline       Completed without error       00%      5062         -
# 3  Short offline       Completed without error       00%       457         -



Well - what do you make of this? In normal operations the link resets do not 
seem to occur at all, even under stress tests.


Dex


-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d--(+)@ s-:+ a C++++ UL++ P+>++ L+++>++++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ 
b++(+++) DI+++ D- G++ e* h>++ r* y?
------END GEEK CODE BLOCK------

Attachment: signature.asc
Description: This is a digitally signed message part.


Reply to: