SMART issues (Was: Lost interrupt, page allocation failure, and kernel oops)

Another (and hopefully final) discovery: having played around with
smartmontools, I think now that you first have to enable SMART on
a device before it records errors. I have done this and started
the cp-process again (the one which initially resulted in the
crash).  And indeed,

  smartctl -a /dev/...

returns errors for both disks connected to my AEC IDE card; I
attached the output of 'smartctl -a /dev/hde2' below.  The (I think)
corresponding entries in /var/log/messages for hdf2 are:

  Apr  1 20:45:41 bumbum kernel: hdf: dma_intr: status=0x51
    { DriveReady SeekComplete Error }
  Apr  1 20:45:41 bumbum kernel: hdf: dma_intr: error=0x84
    { DriveStatusError BadCRC }

(And this is repeated many, many times.) For hde2, the entries
are identical.

The main question I now have is: does a SMART error mean hardware
failure 100%, or might this be the cause of a software problem?
If it can only be hardware failure then we have solved the
problem --- with all your kind help! --- and the "fault" is not
with Linux/GNU!

On the other hand, these two disks are rather new (3 months)
and have *NOT AT ALL* been used heavily so far: that is, they are
mere backup disks and I have written data to them only once
(which resulted in the original crash) and the rest of the
time they have been running idle (with very rare exceptions
maybe). What I want to say is this: why exactly these two disks
when they are connected to the same IDE card? Is the driver
maybe doing something wrong? Recall here that I am using
the AEC68xx driver with Thibaut Varene's AEC6280 patch,


I am very grateful for any additional information you can
give me!

Thanks and a nice day,

P.S. Here's some more info:

  bumbum:/var/log# cat /proc/ide/aec62xx
  Controller: 0
  Chipset: AEC865
--------------- Primary Channel ---------------- Secondary Channel -------------
                   enabled                          enabled
--------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
  DMA enabled:    yes              yes             no                no
DMA Mode: UDMA(5) UDMA(5) PIO (?) PIO(?)

and the excerpt from dmesg:

AEC6280R: IDE controller at PCI slot 0000:01:02.0
AEC6280R: chipset revision 7
AEC6280R: ROM enabled at 0x80890000
AEC6280R: 100% native mode on irq 23
    ide2: BM-DMA at 0x1400-0x1407, BIOS settings: hde:pio, hdf:pio
    ide3: BM-DMA at 0x1408-0x140f, BIOS settings: hdg:pio, hdh:pio
Probing IDE interface ide2...
hde: Maxtor 6Y120P0, ATA DISK drive
hdf: Maxtor 6Y120P0, ATA DISK drive
ide2 at 0x14b0-0x14b7,0x14a2 on irq 23
Probing IDE interface ide3...
CMD646: IDE controller at PCI slot 0000:01:01.0
CMD646: chipset revision 7
CMD646: chipset revision 0x07, UltraDMA Capable
CMD646: 100% native mode on irq 26
    ide1: BM-DMA at 0x14c0-0x14c7, BIOS settings: hdc:pio, hdd:pio
    ide4: BM-DMA at 0x14c8-0x14cf, BIOS settings: hdi:pio, hdj:pio
Probing IDE interface ide1...
hdd: Maxtor 6Y120L0, ATA DISK drive
Unhandled interrupt 1a, disabled
ide1 at 0x1800-0x1807,0x14f2 on irq 26
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
hdb: max request size: 128KiB
hdb: 241254720 sectors (123522 MB) w/1863KiB Cache, CHS=65535/16/63, (U)DMA
/dev/ide/host0/bus0/target1/lun0: [mac] p1 p2
hde: max request size: 128KiB
hde: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63, UDMA(100)
/dev/ide/host2/bus0/target0/lun0: [mac] p1 p2
hdf: max request size: 128KiB
hdf: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63, UDMA(100)
/dev/ide/host2/bus0/target1/lun0: [mac] p1 p2
hdc: max request size: 128KiB
hdc: 78177792 sectors (40027 MB) w/1902KiB Cache, CHS=65535/16/63, UDMA(33)
/dev/ide/host1/bus0/target0/lun0: [mac] p1 p2 p3 p4
hdd: max request size: 128KiB
hdd: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(33)
/dev/ide/host1/bus0/target1/lun0: [mac] p1 p2

Finally the smartctl output(s):

************** BEGIN smartctl -a /dev/hde2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device Model:     Maxtor 6Y120P0
Serial Number:    Y3Q0BALE
Firmware Version: YAR41BW0
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Sat Apr  1 20:15:40 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 244)	Self-test routine in progress...
					40% of test remaining.
Total time to complete Offline
data collection: 		 ( 242) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  54) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 204 204 063 Pre-fail Always - 4476 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 22 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 0 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 250 249 187 Pre-fail Always - 38302 9 Power_On_Minutes 0x0032 247 247 000 Old_age Always - 989h+22m 10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 253 253 000 Old_age Always - 37 192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 36 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 1545 196 Reallocated_Event_Count 0x0008 253 253 000 Old_age Offline - 0 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 1 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 9 202 TA_Increase_Count 0x000a 253 252 000 Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 0 204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0 207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0 208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0 209 Offline_Seek_Performnce 0x0024 196 187 000 Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1574 hours (65 days + 14 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 18 40 00 c8 e7 Error: ICRC, ABRT at LBA = 0x07c80040 = 130547776

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 18 40 00 c8 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 18 40 00 c4 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 08 40 00 c0 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 20 40 00 bc e7 08  35d+03:28:04.816  WRITE DMA
  ca 00 18 40 00 b8 e7 08  35d+03:28:04.816  WRITE DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime (hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 1617 -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
************** END smartctl -a /dev/hde2 *********************

************** BEGIN smartctl -a /dev/hdf2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device Model:     Maxtor 6Y120P0
Serial Number:    Y3Q0BARE
Firmware Version: YAR41BW0
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Sun Apr  2 16:02:56 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 244)	Self-test routine in progress...
					40% of test remaining.
Total time to complete Offline
data collection: 		 ( 242) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  54) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 204 204 063 Pre-fail Always - 4480 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 22 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 0 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 250 249 187 Pre-fail Always - 43683 9 Power_On_Minutes 0x0032 247 247 000 Old_age Always - 1007h+04m 10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 253 253 000 Old_age Always - 37 192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 25 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 1439 196 Reallocated_Event_Count 0x0008 253 253 000 Old_age Offline - 0 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0008 111 111 000 Old_age Offline - 226 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 1 202 TA_Increase_Count 0x000a 253 252 000 Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 0 204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0 207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0 208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0 209 Offline_Seek_Performnce 0x0024 195 188 000 Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0

SMART Error Log Version: 1
Warning: ATA error count 226 inconsistent with error log pointer 5

ATA Error Count: 226 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 226 occurred at disk power-on lifetime: 1950 hours (81 days + 6 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 00 f8 ad b5 f0 Error: ICRC, ABRT at LBA = 0x00b5adf8 = 11906552

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 f8 ad b5 f0 08   1d+09:34:29.072  WRITE DMA
  ca 00 00 f8 ac b5 f0 08   1d+09:34:29.072  WRITE DMA
  ca 00 00 f8 ab b5 f0 08   1d+09:34:29.056  WRITE DMA
  ca 00 00 f8 aa b5 f0 08   1d+09:34:29.056  WRITE DMA
  ca 00 00 f8 a9 b5 f0 08   1d+09:34:29.056  WRITE DMA

Error 225 occurred at disk power-on lifetime: 1950 hours (81 days + 6 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 00 e0 00 f9 f0 Error: ICRC, ABRT at LBA = 0x00f900e0 = 16318688

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 e0 00 f9 f0 08   1d+09:31:08.864  WRITE DMA
  ca 00 00 e0 ff f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 00 e0 fe f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 d8 00 fe f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 00 00 fd f8 f0 08   1d+09:31:08.816  WRITE DMA

Error 224 occurred at disk power-on lifetime: 1950 hours (81 days + 6 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 00 f0 00 e3 f0 Error: ICRC, ABRT at LBA = 0x00e300f0 = 14876912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 f0 00 e3 f0 08   1d+09:30:35.104  WRITE DMA
  ca 00 00 f0 ff e2 f0 08   1d+09:30:35.072  WRITE DMA
  ca 00 00 f0 fe e2 f0 08   1d+09:30:35.056  WRITE DMA
  ca 00 00 f0 fd e2 f0 08   1d+09:30:35.056  WRITE DMA
  ca 00 00 f0 fc e2 f0 08   1d+09:30:35.056  WRITE DMA

Error 223 occurred at disk power-on lifetime: 1950 hours (81 days + 6 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 c0 08 43 c6 f0 Error: ICRC, ABRT at LBA = 0x00c64308 = 12993288

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 c8 42 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 41 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 40 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 3f c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 3e c6 f0 08   1d+09:28:45.008  WRITE DMA

Error 222 occurred at disk power-on lifetime: 1950 hours (81 days + 6 hours) When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
84 51 00 00 18 b3 f0 Error: ICRC, ABRT at LBA = 0x00b31800 = 11737088

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 00 18 b3 f0 08   1d+09:29:39.808  WRITE DMA
  c8 00 08 40 00 c0 f0 08   1d+09:29:39.808  READ DMA
  ca 00 00 00 17 b3 f0 08   1d+09:29:39.776  WRITE DMA
  ca 00 00 00 17 b3 f0 08   1d+09:29:39.680  WRITE DMA
  ca 00 08 f8 16 b3 f0 08   1d+09:29:39.648  WRITE DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime (hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 1952 -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
************** END smartctl -a /dev/hdf2 *********************

