SMART issues (Was: Lost interrupt, page allocation failure, and kernel oops)
Another (and hopefully final) discovery: having played around with
smartmontools, I think now that you first have to enable SMART on
a device before it records errors. I have done this and started
the cp-process again (the one which initially resulted in the
crash). And indeed,
smartctl -a /dev/...
returns errors for both disks connected to my AEC IDE card; I
attached the output of 'smartctl -a /dev/hde2' below. The (I think)
corresponding entries in /var/log/messages for hdf2 are:
Apr 1 20:45:41 bumbum kernel: hdf: dma_intr: status=0x51
{ DriveReady SeekComplete Error }
Apr 1 20:45:41 bumbum kernel: hdf: dma_intr: error=0x84
{ DriveStatusError BadCRC }
(And this is repeated many, many times.) For hde2, the entries
are identical.
The main question I now have is: does a SMART error mean hardware
failure 100%, or might this be the cause of a software problem?
If it can only be hardware failure then we have solved the
problem --- with all your kind help! --- and the "fault" is not
with Linux/GNU!
On the other hand, these two disks are rather new (3 months)
and have *NOT AT ALL* been used heavily so far: that is, they are
mere backup disks and I have written data to them only once
(which resulted in the original crash) and the rest of the
time they have been running idle (with very rare exceptions
maybe). What I want to say is this: why exactly these two disks
when they are connected to the same IDE card? Is the driver
maybe doing something wrong? Recall here that I am using
the AEC68xx driver with Thibaut Varene's AEC6280 patch,
http://marc.theaimsgroup.com/?l=linux-ide&m=113128446708744&q=p3
I am very grateful for any additional information you can
give me!
Thanks and a nice day,
Kaspar
P.S. Here's some more info:
bumbum:/var/log# cat /proc/ide/aec62xx
Controller: 0
Chipset: AEC865
--------------- Primary Channel ---------------- Secondary Channel
-------------
enabled enabled
--------------- drive0 --------- drive1 -------- drive0 ----------
drive1 ------
DMA enabled: yes yes no no
DMA Mode: UDMA(5) UDMA(5) PIO
(?) PIO(?)
and the excerpt from dmesg:
AEC6280R: IDE controller at PCI slot 0000:01:02.0
AEC6280R: chipset revision 7
AEC6280R: ROM enabled at 0x80890000
AEC6280R: 100% native mode on irq 23
ide2: BM-DMA at 0x1400-0x1407, BIOS settings: hde:pio, hdf:pio
ide3: BM-DMA at 0x1408-0x140f, BIOS settings: hdg:pio, hdh:pio
Probing IDE interface ide2...
hde: Maxtor 6Y120P0, ATA DISK drive
hdf: Maxtor 6Y120P0, ATA DISK drive
ide2 at 0x14b0-0x14b7,0x14a2 on irq 23
Probing IDE interface ide3...
CMD646: IDE controller at PCI slot 0000:01:01.0
CMD646: chipset revision 7
CMD646: chipset revision 0x07, UltraDMA Capable
CMD646: 100% native mode on irq 26
ide1: BM-DMA at 0x14c0-0x14c7, BIOS settings: hdc:pio, hdd:pio
ide4: BM-DMA at 0x14c8-0x14cf, BIOS settings: hdi:pio, hdj:pio
Probing IDE interface ide1...
hdc: QUANTUM FIREBALLP AS40.0, ATA DISK drive
hdd: Maxtor 6Y120L0, ATA DISK drive
Unhandled interrupt 1a, disabled
ide1 at 0x1800-0x1807,0x14f2 on irq 26
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
hdb: max request size: 128KiB
hdb: 241254720 sectors (123522 MB) w/1863KiB Cache, CHS=65535/16/63,
(U)DMA
/dev/ide/host0/bus0/target1/lun0: [mac] p1 p2
hde: max request size: 128KiB
hde: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63,
UDMA(100)
/dev/ide/host2/bus0/target0/lun0: [mac] p1 p2
hdf: max request size: 128KiB
hdf: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63,
UDMA(100)
/dev/ide/host2/bus0/target1/lun0: [mac] p1 p2
hdc: max request size: 128KiB
hdc: 78177792 sectors (40027 MB) w/1902KiB Cache, CHS=65535/16/63,
UDMA(33)
/dev/ide/host1/bus0/target0/lun0: [mac] p1 p2 p3 p4
hdd: max request size: 128KiB
hdd: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63,
UDMA(33)
/dev/ide/host1/bus0/target1/lun0: [mac] p1 p2
Finally the smartctl output(s):
************** BEGIN smartctl -a /dev/hde2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y120P0
Serial Number: Y3Q0BALE
Firmware Version: YAR41BW0
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Sat Apr 1 20:15:40 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 244) Self-test routine in progress...
40% of test remaining.
Total time to complete Offline
data collection: ( 242) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 54) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 204 204 063 Pre-fail
Always - 4476
4 Start_Stop_Count 0x0032 253 253 000 Old_age
Always - 22
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail
Always - 0
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail
Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
8 Seek_Time_Performance 0x0027 250 249 187 Pre-fail
Always - 38302
9 Power_On_Minutes 0x0032 247 247 000 Old_age
Always - 989h+22m
10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail
Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age
Always - 37
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age
Always - 0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age
Always - 0
194 Temperature_Celsius 0x0032 253 253 000 Old_age
Always - 36
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age
Always - 1545
196 Reallocated_Event_Count 0x0008 253 253 000 Old_age
Offline - 0
197 Current_Pending_Sector 0x0008 253 253 000 Old_age
Offline - 0
198 Offline_Uncorrectable 0x0008 253 253 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age
Offline - 1
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age
Always - 9
202 TA_Increase_Count 0x000a 253 252 000 Old_age
Always - 0
203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail
Always - 0
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
207 Spin_High_Current 0x002a 253 252 000 Old_age
Always - 0
208 Spin_Buzz 0x002a 253 252 000 Old_age
Always - 0
209 Offline_Seek_Performnce 0x0024 196 187 000 Old_age
Offline - 0
99 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
100 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
101 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 1574 hours (65 days + 14
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 18 40 00 c8 e7 Error: ICRC, ABRT at LBA = 0x07c80040 =
130547776
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 18 40 00 c8 e7 08 35d+03:28:04.832 WRITE DMA
ca 00 18 40 00 c4 e7 08 35d+03:28:04.832 WRITE DMA
ca 00 08 40 00 c0 e7 08 35d+03:28:04.832 WRITE DMA
ca 00 20 40 00 bc e7 08 35d+03:28:04.816 WRITE DMA
ca 00 18 40 00 b8 e7 08 35d+03:28:04.816 WRITE DMA
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime
(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00%
1617 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.
************** END smartctl -a /dev/hde2 *********************
************** BEGIN smartctl -a /dev/hdf2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y120P0
Serial Number: Y3Q0BARE
Firmware Version: YAR41BW0
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Sun Apr 2 16:02:56 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 244) Self-test routine in progress...
40% of test remaining.
Total time to complete Offline
data collection: ( 242) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 54) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 204 204 063 Pre-fail
Always - 4480
4 Start_Stop_Count 0x0032 253 253 000 Old_age
Always - 22
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail
Always - 0
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail
Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
8 Seek_Time_Performance 0x0027 250 249 187 Pre-fail
Always - 43683
9 Power_On_Minutes 0x0032 247 247 000 Old_age
Always - 1007h+04m
10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail
Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age
Always - 37
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age
Always - 0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age
Always - 0
194 Temperature_Celsius 0x0032 253 253 000 Old_age
Always - 25
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age
Always - 1439
196 Reallocated_Event_Count 0x0008 253 253 000 Old_age
Offline - 0
197 Current_Pending_Sector 0x0008 253 253 000 Old_age
Offline - 0
198 Offline_Uncorrectable 0x0008 253 253 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0008 111 111 000 Old_age
Offline - 226
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age
Always - 0
201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age
Always - 1
202 TA_Increase_Count 0x000a 253 252 000 Old_age
Always - 0
203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail
Always - 0
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age
Always - 0
207 Spin_High_Current 0x002a 253 252 000 Old_age
Always - 0
208 Spin_Buzz 0x002a 253 252 000 Old_age
Always - 0
209 Offline_Seek_Performnce 0x0024 195 188 000 Old_age
Offline - 0
99 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
100 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
101 Unknown_Attribute 0x0004 253 253 000 Old_age
Offline - 0
SMART Error Log Version: 1
Warning: ATA error count 226 inconsistent with error log pointer 5
ATA Error Count: 226 (device log contains only the most recent five
errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 226 occurred at disk power-on lifetime: 1950 hours (81 days + 6
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 f8 ad b5 f0 Error: ICRC, ABRT at LBA = 0x00b5adf8 =
11906552
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 f8 ad b5 f0 08 1d+09:34:29.072 WRITE DMA
ca 00 00 f8 ac b5 f0 08 1d+09:34:29.072 WRITE DMA
ca 00 00 f8 ab b5 f0 08 1d+09:34:29.056 WRITE DMA
ca 00 00 f8 aa b5 f0 08 1d+09:34:29.056 WRITE DMA
ca 00 00 f8 a9 b5 f0 08 1d+09:34:29.056 WRITE DMA
Error 225 occurred at disk power-on lifetime: 1950 hours (81 days + 6
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 e0 00 f9 f0 Error: ICRC, ABRT at LBA = 0x00f900e0 =
16318688
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 e0 00 f9 f0 08 1d+09:31:08.864 WRITE DMA
ca 00 00 e0 ff f8 f0 08 1d+09:31:08.816 WRITE DMA
ca 00 00 e0 fe f8 f0 08 1d+09:31:08.816 WRITE DMA
ca 00 d8 00 fe f8 f0 08 1d+09:31:08.816 WRITE DMA
ca 00 00 00 fd f8 f0 08 1d+09:31:08.816 WRITE DMA
Error 224 occurred at disk power-on lifetime: 1950 hours (81 days + 6
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 f0 00 e3 f0 Error: ICRC, ABRT at LBA = 0x00e300f0 =
14876912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 f0 00 e3 f0 08 1d+09:30:35.104 WRITE DMA
ca 00 00 f0 ff e2 f0 08 1d+09:30:35.072 WRITE DMA
ca 00 00 f0 fe e2 f0 08 1d+09:30:35.056 WRITE DMA
ca 00 00 f0 fd e2 f0 08 1d+09:30:35.056 WRITE DMA
ca 00 00 f0 fc e2 f0 08 1d+09:30:35.056 WRITE DMA
Error 223 occurred at disk power-on lifetime: 1950 hours (81 days + 6
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 c0 08 43 c6 f0 Error: ICRC, ABRT at LBA = 0x00c64308 =
12993288
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 c8 42 c6 f0 08 1d+09:28:45.024 WRITE DMA
ca 00 00 c8 41 c6 f0 08 1d+09:28:45.024 WRITE DMA
ca 00 00 c8 40 c6 f0 08 1d+09:28:45.024 WRITE DMA
ca 00 00 c8 3f c6 f0 08 1d+09:28:45.024 WRITE DMA
ca 00 00 c8 3e c6 f0 08 1d+09:28:45.008 WRITE DMA
Error 222 occurred at disk power-on lifetime: 1950 hours (81 days + 6
hours)
When the command that caused the error occurred, the device was in
an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 18 b3 f0 Error: ICRC, ABRT at LBA = 0x00b31800 =
11737088
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 00 18 b3 f0 08 1d+09:29:39.808 WRITE DMA
c8 00 08 40 00 c0 f0 08 1d+09:29:39.808 READ DMA
ca 00 00 00 17 b3 f0 08 1d+09:29:39.776 WRITE DMA
ca 00 00 00 17 b3 f0 08 1d+09:29:39.680 WRITE DMA
ca 00 08 f8 16 b3 f0 08 1d+09:29:39.648 WRITE DMA
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime
(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00%
1952 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.
************** END smartctl -a /dev/hdf2 *********************
Reply to: