[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#625922: SATA devices get reset without real hardware failure



I can confirm the same problem.

cat /var/log/messages.0 |grep ata
Aug 28 00:11:45 lrdlnx kernel: ata2: hard resetting link
Aug 28 00:11:45 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 00:11:45 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 00:11:45 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 00:11:45 lrdlnx kernel: ata2: EH complete
Aug 28 00:31:24 lrdlnx kernel: ata2: hard resetting link
Aug 28 00:31:24 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 00:31:24 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 00:31:24 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 00:31:24 lrdlnx kernel: ata2: EH complete
Aug 28 01:02:13 lrdlnx clamd[4832]: SelfCheck: Database status OK.
Aug 28 02:39:01 lrdlnx freshclam[4935]: Database updated (1029731
signatures) from db.local.clamav.net (IP: 85.254.217.235)
Aug 28 02:50:15 lrdlnx kernel: ata2: hard resetting link
Aug 28 02:50:15 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 02:50:15 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 02:50:15 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 02:50:15 lrdlnx kernel: ata2: EH complete
Aug 28 03:02:07 lrdlnx clamd[4832]: SelfCheck: Database modification
detected. Forcing reload.
Aug 28 03:02:08 lrdlnx clamd[4832]: Reading databases from /var/lib/clamav
Aug 28 03:02:18 lrdlnx clamd[4832]: Database correctly reloaded (1028330
signatures)
Aug 28 03:08:55 lrdlnx kernel: ata2: hard resetting link
Aug 28 03:08:55 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 03:08:56 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 03:08:56 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 03:08:56 lrdlnx kernel: ata2: EH complete
Aug 28 03:08:58 lrdlnx kernel: ata2: hard resetting link
Aug 28 03:08:58 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 03:08:58 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 03:08:58 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 03:08:58 lrdlnx kernel: ata2: EH complete

after 5PM no errors /var/log/messages, sometimes error can be seen in
log once every few minutes, sometimes hours
or even days, system is running 24/7

around the time I started notice errrors I had just replaced smaller
drives with 2TB Western Digital Caviar Green WD20EARS
which use "IntelliPower", variable spin rate 5400-7200rpm

just to be sure I already replaced SATA cables with new ones

SATA is Nvidia:
root@lrdlnx:~# lspci |grep -i sata
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)


my raid:
root@lrdlnx:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda5[2] sdb5[1]
      1857650986 blocks super 1.2 [2/2] [UU]
     
md1 : active raid1 sdb2[1] sda2[0]
      70011200 blocks [2/2] [UU]
     
md3 : active raid1 sdd1[1] sdc1[0]
      730957376 blocks [2/2] [UU]
     
md0 : active raid1 sdb1[1] sda1[0]
      136448 blocks [2/2] [UU]
     
unused devices: <none>

I have run tests few time with no errors and only thing is I these
errors but everything is working perfectly:

root@lrdlnx:~# badblocks -vv /dev/sda
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test):
done                               
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdb
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test):
done                               
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdc
Checking blocks 0 to 732574583
Checking for bad blocks (read-only test):
done                               
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdd
Checking blocks 0 to 732574583
Checking for bad blocks (read-only test):
done                               
Pass completed, 0 bad blocks found.
root@lrdlnx:~#

                                                                            

root@lrdlnx:~# smartctl -t short /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:21:57 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:02 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdc
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:05 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdd
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:08 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -l selftest /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                   Remaining
LifeTime(hours)  LBA_of_first_error
# 1 Short offline        Completed without error        00%     
1109        -
# 2 Short offline        Completed without error        00%     
1104        -
# 3 Short offline        Completed without error        00%     
1080        -
# 4 Short offline        Completed without error        00%     
1057        -
# 5 Short offline        Completed without error        00%     
1033        -
# 6 Short offline        Completed without error        00%     
1009        -
# 7 Short offline        Completed without error        00%      
985        -
# 8 Short offline        Completed without error        00%      
961        -
# 9 Short offline        Completed without error        00%      
937        -
#10 Short offline        Completed without error        00%      
913        -
#11 Short offline        Completed without error        00%      
889        -
#12 Short offline        Completed without error        00%      
865        -
#13 Short offline        Completed without error        00%      
841        -
#14 Short offline        Completed without error        00%      
817        -
#15 Short offline        Completed without error        00%      
793        -
#16 Short offline        Completed without error        00%      
770        -
#17 Short offline        Completed without error        00%      
748        -
#18 Short offline        Completed without error        00%      
724        -
#19 Short offline        Completed without error        00%      
700        -
#20 Short offline        Completed without error        00%      
676        -
#21 Short offline        Completed without error        00%      
652        -
root@lrdlnx:~# smartctl -l selftest /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                   Remaining
LifeTime(hours)  LBA_of_first_error
# 1 Short offline        Completed without error        00%     
1116        -
# 2 Short offline        Completed without error        00%     
1111        -
# 3 Short offline        Completed without error        00%     
1087        -
# 4 Short offline        Completed without error        00%     
1063        -
# 5 Short offline        Completed without error        00%     
1039        -
# 6 Short offline        Completed without error        00%     
1015        -
# 7 Short offline        Completed without error        00%      
991        -
# 8 Short offline        Completed without error        00%      
967        -
# 9 Short offline        Completed without error        00%      
943        -
#10 Short offline        Completed without error        00%      
919        -
#11 Short offline        Completed without error        00%      
895        -
#12 Short offline        Completed without error        00%      
871        -
#13 Short offline        Completed without error        00%      
847        -
#14 Short offline        Completed without error        00%      
823        -
#15 Short offline        Completed without error        00%      
800        -
#16 Short offline        Completed without error        00%      
776        -
#17 Short offline        Completed without error        00%      
754        -
#18 Short offline        Completed without error        00%      
730        -
#19 Short offline        Completed without error        00%      
706        -
#20 Short offline        Completed without error        00%      
682        -
#21 Short offline        Completed without error        00%      
658        -
root@lrdlnx:~# smartctl -l selftest /dev/sdc
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                  Remaining 
LifeTime(hours) LBA_of_first_error
# 1 Short offline        Completed without error       00%    
16121        -
# 2 Short offline        Completed without error       00%    
16116        -
# 3 Short offline        Completed without error       00%    
16092        -
# 4 Short offline        Completed without error       00%    
16068        -
# 5 Short offline        Completed without error       00%    
16044        -
# 6 Short offline        Completed without error       00%    
16020        -
# 7 Short offline        Completed without error       00%    
15996        -
# 8 Short offline        Completed without error       00%    
15972        -
# 9 Short offline        Completed without error       00%    
15948        -
#10 Short offline        Completed without error       00%    
15924        -
#11 Short offline        Completed without error       00%    
15900        -
#12 Short offline        Completed without error       00%    
15876        -
#13 Short offline        Completed without error       00%    
15852        -
#14 Short offline        Completed without error       00%    
15828        -
#15 Short offline        Completed without error       00%    
15804        -
#16 Short offline        Completed without error       00%    
15780        -
#17 Short offline        Completed without error       00%    
15758        -
#18 Short offline        Completed without error       00%    
15734        -
#19 Short offline        Completed without error       00%    
15710        -
#20 Short offline        Completed without error       00%    
15686        -
#21 Short offline        Completed without error       00%    
15662        -
root@lrdlnx:~# smartctl -l selftest /dev/sdd
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                  Remaining 
LifeTime(hours) LBA_of_first_error
# 1 Short offline        Completed without error       00%    
16122        -
# 2 Short offline        Completed without error       00%    
16117        -
# 3 Short offline        Completed without error       00%    
16093        -
# 4 Short offline        Completed without error       00%    
16069        -
# 5 Short offline        Completed without error       00%    
16045        -
# 6 Short offline        Completed without error       00%    
16021        -
# 7 Short offline        Completed without error       00%    
15997        -
# 8 Short offline        Completed without error       00%    
15973        -
# 9 Short offline        Completed without error       00%    
15949        -
#10 Short offline        Completed without error       00%    
15925        -
#11 Short offline        Completed without error       00%    
15901        -
#12 Short offline        Completed without error       00%    
15877        -
#13 Short offline        Completed without error       00%    
15853        -
#14 Short offline        Completed without error       00%    
15829        -
#15 Short offline        Completed without error       00%    
15805        -
#16 Short offline        Completed without error       00%    
15781        -
#17 Short offline        Completed without error       00%    
15759        -
#18 Short offline        Completed without error       00%    
15735        -
#19 Short offline        Completed without error       00%    
15711        -
#20 Short offline        Completed without error       00%    
15687        -
#21 Short offline        Completed without error       00%    
15663        -




these error just make worried because last time I had real hdd failure,
I saw similiar port reset errors
but also actual errors on drive like I/O error, read failure

Apr 16 21:44:19 lrd-selleri kernel:          res 40/00:00:00:00:e0/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Apr 16 21:44:19 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:44:19 lrd-selleri kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 16 21:44:19 lrd-selleri kernel: ata1.00: configured for UDMA/133
Apr 16 21:44:19 lrd-selleri kernel: ata1: EH complete
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] Write Protect is off
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 16 21:50:32 lrd-selleri kernel:          res 40/00:00:00:00:e0/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: port is slow to respond, please be patient (Status 0x80)
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:32 lrd-selleri kernel: ata1: failed to recover some devices, retrying in 5 secs
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: limiting speed to UDMA/133:PIO3
Apr 16 21:50:33 lrd-selleri kernel: ata1: failed to recover some devices, retrying in 5 secs
Apr 16 21:50:33 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:33 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: disabled
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Apr 16 21:50:33 lrd-selleri kernel: Descriptor sense data with sense descriptors (in hex):
Apr 16 21:50:33 lrd-selleri kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Apr 16 21:50:33 lrd-selleri kernel:         00 00 00 00 
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Apr 16 21:50:33 lrd-selleri kernel: end_request: I/O error, dev sda, sector 272308480
Apr 16 21:50:33 lrd-selleri kernel: md: super_written gets error=-5, uptodate=0
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:33 lrd-selleri kernel: ata1: EH complete
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: detaching (SCSI 0:0:0:0)
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Stopping disk
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] START_STOP FAILED
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel:  --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel:  disk 0, wo:0, o:1, dev:sdb2
Apr 16 21:50:33 lrd-selleri kernel:  disk 1, wo:1, o:0, dev:sda2
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel:  --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel:  disk 0, wo:0, o:1, dev:sdb2
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel:  --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel:  disk 0, wo:0, o:1, dev:sdb1
Apr 16 21:50:33 lrd-selleri kernel:  disk 1, wo:1, o:0, dev:sda1
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel:  --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel:  disk 0, wo:0, o:1, dev:sdb1
Apr 16 21:50:33 lrd-selleri kernel:  to dead device
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:34 lrd-selleri kernel:  to dead device


-- 
-------------------------
Juhani Karlsson
juhani dot karlsson at iki dot fi
http://lrdlnx.iki.fi
-------------------------

X-Virus-Scanned: Debian amavisd-new (with ClamAV) at lrdlnx.iki.fi




Reply to: