Bug#625922: SATA devices get reset without real hardware failure
I can confirm the same problem.
cat /var/log/messages.0 |grep ata
Aug 28 00:11:45 lrdlnx kernel: ata2: hard resetting link
Aug 28 00:11:45 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 00:11:45 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 00:11:45 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 00:11:45 lrdlnx kernel: ata2: EH complete
Aug 28 00:31:24 lrdlnx kernel: ata2: hard resetting link
Aug 28 00:31:24 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 00:31:24 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 00:31:24 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 00:31:24 lrdlnx kernel: ata2: EH complete
Aug 28 01:02:13 lrdlnx clamd[4832]: SelfCheck: Database status OK.
Aug 28 02:39:01 lrdlnx freshclam[4935]: Database updated (1029731
signatures) from db.local.clamav.net (IP: 85.254.217.235)
Aug 28 02:50:15 lrdlnx kernel: ata2: hard resetting link
Aug 28 02:50:15 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 02:50:15 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 02:50:15 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 02:50:15 lrdlnx kernel: ata2: EH complete
Aug 28 03:02:07 lrdlnx clamd[4832]: SelfCheck: Database modification
detected. Forcing reload.
Aug 28 03:02:08 lrdlnx clamd[4832]: Reading databases from /var/lib/clamav
Aug 28 03:02:18 lrdlnx clamd[4832]: Database correctly reloaded (1028330
signatures)
Aug 28 03:08:55 lrdlnx kernel: ata2: hard resetting link
Aug 28 03:08:55 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 03:08:56 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 03:08:56 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 03:08:56 lrdlnx kernel: ata2: EH complete
Aug 28 03:08:58 lrdlnx kernel: ata2: hard resetting link
Aug 28 03:08:58 lrdlnx kernel: ata2: nv: skipping hardreset on occupied port
Aug 28 03:08:58 lrdlnx kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Aug 28 03:08:58 lrdlnx kernel: ata2.00: configured for UDMA/133
Aug 28 03:08:58 lrdlnx kernel: ata2: EH complete
after 5PM no errors /var/log/messages, sometimes error can be seen in
log once every few minutes, sometimes hours
or even days, system is running 24/7
around the time I started notice errrors I had just replaced smaller
drives with 2TB Western Digital Caviar Green WD20EARS
which use "IntelliPower", variable spin rate 5400-7200rpm
just to be sure I already replaced SATA cables with new ones
SATA is Nvidia:
root@lrdlnx:~# lspci |grep -i sata
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
my raid:
root@lrdlnx:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda5[2] sdb5[1]
1857650986 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
70011200 blocks [2/2] [UU]
md3 : active raid1 sdd1[1] sdc1[0]
730957376 blocks [2/2] [UU]
md0 : active raid1 sdb1[1] sda1[0]
136448 blocks [2/2] [UU]
unused devices: <none>
I have run tests few time with no errors and only thing is I these
errors but everything is working perfectly:
root@lrdlnx:~# badblocks -vv /dev/sda
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test):
done
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdb
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test):
done
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdc
Checking blocks 0 to 732574583
Checking for bad blocks (read-only test):
done
Pass completed, 0 bad blocks found.
root@lrdlnx:~# badblocks -vv /dev/sdd
Checking blocks 0 to 732574583
Checking for bad blocks (read-only test):
done
Pass completed, 0 bad blocks found.
root@lrdlnx:~#
root@lrdlnx:~# smartctl -t short /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:21:57 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:02 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdc
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:05 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -t short /dev/sdd
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Aug 19 08:22:08 2011
Use smartctl -X to abort test.
root@lrdlnx:~# smartctl -l selftest /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00%
1109 -
# 2 Short offline Completed without error 00%
1104 -
# 3 Short offline Completed without error 00%
1080 -
# 4 Short offline Completed without error 00%
1057 -
# 5 Short offline Completed without error 00%
1033 -
# 6 Short offline Completed without error 00%
1009 -
# 7 Short offline Completed without error 00%
985 -
# 8 Short offline Completed without error 00%
961 -
# 9 Short offline Completed without error 00%
937 -
#10 Short offline Completed without error 00%
913 -
#11 Short offline Completed without error 00%
889 -
#12 Short offline Completed without error 00%
865 -
#13 Short offline Completed without error 00%
841 -
#14 Short offline Completed without error 00%
817 -
#15 Short offline Completed without error 00%
793 -
#16 Short offline Completed without error 00%
770 -
#17 Short offline Completed without error 00%
748 -
#18 Short offline Completed without error 00%
724 -
#19 Short offline Completed without error 00%
700 -
#20 Short offline Completed without error 00%
676 -
#21 Short offline Completed without error 00%
652 -
root@lrdlnx:~# smartctl -l selftest /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00%
1116 -
# 2 Short offline Completed without error 00%
1111 -
# 3 Short offline Completed without error 00%
1087 -
# 4 Short offline Completed without error 00%
1063 -
# 5 Short offline Completed without error 00%
1039 -
# 6 Short offline Completed without error 00%
1015 -
# 7 Short offline Completed without error 00%
991 -
# 8 Short offline Completed without error 00%
967 -
# 9 Short offline Completed without error 00%
943 -
#10 Short offline Completed without error 00%
919 -
#11 Short offline Completed without error 00%
895 -
#12 Short offline Completed without error 00%
871 -
#13 Short offline Completed without error 00%
847 -
#14 Short offline Completed without error 00%
823 -
#15 Short offline Completed without error 00%
800 -
#16 Short offline Completed without error 00%
776 -
#17 Short offline Completed without error 00%
754 -
#18 Short offline Completed without error 00%
730 -
#19 Short offline Completed without error 00%
706 -
#20 Short offline Completed without error 00%
682 -
#21 Short offline Completed without error 00%
658 -
root@lrdlnx:~# smartctl -l selftest /dev/sdc
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00%
16121 -
# 2 Short offline Completed without error 00%
16116 -
# 3 Short offline Completed without error 00%
16092 -
# 4 Short offline Completed without error 00%
16068 -
# 5 Short offline Completed without error 00%
16044 -
# 6 Short offline Completed without error 00%
16020 -
# 7 Short offline Completed without error 00%
15996 -
# 8 Short offline Completed without error 00%
15972 -
# 9 Short offline Completed without error 00%
15948 -
#10 Short offline Completed without error 00%
15924 -
#11 Short offline Completed without error 00%
15900 -
#12 Short offline Completed without error 00%
15876 -
#13 Short offline Completed without error 00%
15852 -
#14 Short offline Completed without error 00%
15828 -
#15 Short offline Completed without error 00%
15804 -
#16 Short offline Completed without error 00%
15780 -
#17 Short offline Completed without error 00%
15758 -
#18 Short offline Completed without error 00%
15734 -
#19 Short offline Completed without error 00%
15710 -
#20 Short offline Completed without error 00%
15686 -
#21 Short offline Completed without error 00%
15662 -
root@lrdlnx:~# smartctl -l selftest /dev/sdd
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00%
16122 -
# 2 Short offline Completed without error 00%
16117 -
# 3 Short offline Completed without error 00%
16093 -
# 4 Short offline Completed without error 00%
16069 -
# 5 Short offline Completed without error 00%
16045 -
# 6 Short offline Completed without error 00%
16021 -
# 7 Short offline Completed without error 00%
15997 -
# 8 Short offline Completed without error 00%
15973 -
# 9 Short offline Completed without error 00%
15949 -
#10 Short offline Completed without error 00%
15925 -
#11 Short offline Completed without error 00%
15901 -
#12 Short offline Completed without error 00%
15877 -
#13 Short offline Completed without error 00%
15853 -
#14 Short offline Completed without error 00%
15829 -
#15 Short offline Completed without error 00%
15805 -
#16 Short offline Completed without error 00%
15781 -
#17 Short offline Completed without error 00%
15759 -
#18 Short offline Completed without error 00%
15735 -
#19 Short offline Completed without error 00%
15711 -
#20 Short offline Completed without error 00%
15687 -
#21 Short offline Completed without error 00%
15663 -
these error just make worried because last time I had real hdd failure,
I saw similiar port reset errors
but also actual errors on drive like I/O error, read failure
Apr 16 21:44:19 lrd-selleri kernel: res 40/00:00:00:00:e0/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Apr 16 21:44:19 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:44:19 lrd-selleri kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 16 21:44:19 lrd-selleri kernel: ata1.00: configured for UDMA/133
Apr 16 21:44:19 lrd-selleri kernel: ata1: EH complete
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] Write Protect is off
Apr 16 21:44:19 lrd-selleri kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 16 21:50:32 lrd-selleri kernel: res 40/00:00:00:00:e0/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: port is slow to respond, please be patient (Status 0x80)
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:32 lrd-selleri kernel: ata1: failed to recover some devices, retrying in 5 secs
Apr 16 21:50:32 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:32 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: limiting speed to UDMA/133:PIO3
Apr 16 21:50:33 lrd-selleri kernel: ata1: failed to recover some devices, retrying in 5 secs
Apr 16 21:50:33 lrd-selleri kernel: ata1: hard resetting port
Apr 16 21:50:33 lrd-selleri kernel: ata1: SATA link down (SStatus 0 SControl 300)
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: disabled
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Apr 16 21:50:33 lrd-selleri kernel: Descriptor sense data with sense descriptors (in hex):
Apr 16 21:50:33 lrd-selleri kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 16 21:50:33 lrd-selleri kernel: 00 00 00 00
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Apr 16 21:50:33 lrd-selleri kernel: end_request: I/O error, dev sda, sector 272308480
Apr 16 21:50:33 lrd-selleri kernel: md: super_written gets error=-5, uptodate=0
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:33 lrd-selleri kernel: ata1: EH complete
Apr 16 21:50:33 lrd-selleri kernel: ata1.00: detaching (SCSI 0:0:0:0)
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Stopping disk
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] START_STOP FAILED
Apr 16 21:50:33 lrd-selleri kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel: --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel: disk 0, wo:0, o:1, dev:sdb2
Apr 16 21:50:33 lrd-selleri kernel: disk 1, wo:1, o:0, dev:sda2
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel: --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel: disk 0, wo:0, o:1, dev:sdb2
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel: --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel: disk 0, wo:0, o:1, dev:sdb1
Apr 16 21:50:33 lrd-selleri kernel: disk 1, wo:1, o:0, dev:sda1
Apr 16 21:50:33 lrd-selleri kernel: RAID1 conf printout:
Apr 16 21:50:33 lrd-selleri kernel: --- wd:1 rd:2
Apr 16 21:50:33 lrd-selleri kernel: disk 0, wo:0, o:1, dev:sdb1
Apr 16 21:50:33 lrd-selleri kernel: to dead device
Apr 16 21:50:33 lrd-selleri kernel: ^IOperation continuing on 1 devices
Apr 16 21:50:34 lrd-selleri kernel: to dead device
--
-------------------------
Juhani Karlsson
juhani dot karlsson at iki dot fi
http://lrdlnx.iki.fi
-------------------------
X-Virus-Scanned: Debian amavisd-new (with ClamAV) at lrdlnx.iki.fi
Reply to: