SMART issues (Was: Lost interrupt, page allocation failure, and kernel oops)
Another (and hopefully final) discovery: having played around with
smartmontools, I think now that you first have to enable SMART on
a device before it records errors. I have done this and started
the cp-process again (the one which initially resulted in the
crash).  And indeed,
  smartctl -a /dev/...
returns errors for both disks connected to my AEC IDE card; I
attached the output of 'smartctl -a /dev/hde2' below.  The (I think)
corresponding entries in /var/log/messages for hdf2 are:
  Apr  1 20:45:41 bumbum kernel: hdf: dma_intr: status=0x51
    { DriveReady SeekComplete Error }
  Apr  1 20:45:41 bumbum kernel: hdf: dma_intr: error=0x84
    { DriveStatusError BadCRC }
(And this is repeated many, many times.) For hde2, the entries
are identical.
The main question I now have is: does a SMART error mean hardware
failure 100%, or might this be the cause of a software problem?
If it can only be hardware failure then we have solved the
problem --- with all your kind help! --- and the "fault" is not
with Linux/GNU!
On the other hand, these two disks are rather new (3 months)
and have *NOT AT ALL* been used heavily so far: that is, they are
mere backup disks and I have written data to them only once
(which resulted in the original crash) and the rest of the
time they have been running idle (with very rare exceptions
maybe). What I want to say is this: why exactly these two disks
when they are connected to the same IDE card? Is the driver
maybe doing something wrong? Recall here that I am using
the AEC68xx driver with Thibaut Varene's AEC6280 patch,
  http://marc.theaimsgroup.com/?l=linux-ide&m=113128446708744&q=p3
I am very grateful for any additional information you can
give me!
Thanks and a nice day,
Kaspar
P.S. Here's some more info:
  bumbum:/var/log# cat /proc/ide/aec62xx
  Controller: 0
  Chipset: AEC865
  --------------- Primary Channel ---------------- Secondary Channel  
-------------
                   enabled                          enabled
  --------------- drive0 --------- drive1 -------- drive0 ----------  
drive1 ------
  DMA enabled:    yes              yes             no                no
  DMA Mode:       UDMA(5)          UDMA(5)          PIO 
(?)            PIO(?)
and the excerpt from dmesg:
AEC6280R: IDE controller at PCI slot 0000:01:02.0
AEC6280R: chipset revision 7
AEC6280R: ROM enabled at 0x80890000
AEC6280R: 100% native mode on irq 23
    ide2: BM-DMA at 0x1400-0x1407, BIOS settings: hde:pio, hdf:pio
    ide3: BM-DMA at 0x1408-0x140f, BIOS settings: hdg:pio, hdh:pio
Probing IDE interface ide2...
hde: Maxtor 6Y120P0, ATA DISK drive
hdf: Maxtor 6Y120P0, ATA DISK drive
ide2 at 0x14b0-0x14b7,0x14a2 on irq 23
Probing IDE interface ide3...
CMD646: IDE controller at PCI slot 0000:01:01.0
CMD646: chipset revision 7
CMD646: chipset revision 0x07, UltraDMA Capable
CMD646: 100% native mode on irq 26
    ide1: BM-DMA at 0x14c0-0x14c7, BIOS settings: hdc:pio, hdd:pio
    ide4: BM-DMA at 0x14c8-0x14cf, BIOS settings: hdi:pio, hdj:pio
Probing IDE interface ide1...
hdc: QUANTUM FIREBALLP AS40.0, ATA DISK drive
hdd: Maxtor 6Y120L0, ATA DISK drive
Unhandled interrupt 1a, disabled
ide1 at 0x1800-0x1807,0x14f2 on irq 26
Probing IDE interface ide4...
ide4: Wait for ready failed before probe !
hdb: max request size: 128KiB
hdb: 241254720 sectors (123522 MB) w/1863KiB Cache, CHS=65535/16/63,  
(U)DMA
/dev/ide/host0/bus0/target1/lun0: [mac] p1 p2
hde: max request size: 128KiB
hde: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63,  
UDMA(100)
/dev/ide/host2/bus0/target0/lun0: [mac] p1 p2
hdf: max request size: 128KiB
hdf: 240121728 sectors (122942 MB) w/7936KiB Cache, CHS=65535/16/63,  
UDMA(100)
/dev/ide/host2/bus0/target1/lun0: [mac] p1 p2
hdc: max request size: 128KiB
hdc: 78177792 sectors (40027 MB) w/1902KiB Cache, CHS=65535/16/63,  
UDMA(33)
/dev/ide/host1/bus0/target0/lun0: [mac] p1 p2 p3 p4
hdd: max request size: 128KiB
hdd: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63,  
UDMA(33)
/dev/ide/host1/bus0/target1/lun0: [mac] p1 p2
Finally the smartctl output(s):
************** BEGIN smartctl -a /dev/hde2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:     Maxtor 6Y120P0
Serial Number:    Y3Q0BALE
Firmware Version: YAR41BW0
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Sat Apr  1 20:15:40 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 244)	Self-test routine in progress...
					40% of test remaining.
Total time to complete Offline
data collection: 		 ( 242) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  54) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE       
UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   204   204   063    Pre-fail   
Always       -       4476
  4 Start_Stop_Count        0x0032   253   253   000    Old_age    
Always       -       22
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail   
Always       -       0
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail   
Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age    
Always       -       0
  8 Seek_Time_Performance   0x0027   250   249   187    Pre-fail   
Always       -       38302
  9 Power_On_Minutes        0x0032   247   247   000    Old_age    
Always       -       989h+22m
10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail   
Always       -       0
11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail   
Always       -       0
12 Power_Cycle_Count       0x0032   253   253   000    Old_age    
Always       -       37
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age    
Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age    
Always       -       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age    
Always       -       36
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age    
Always       -       1545
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age    
Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age    
Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age    
Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age    
Offline      -       1
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age    
Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age    
Always       -       9
202 TA_Increase_Count       0x000a   253   252   000    Old_age    
Always       -       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail   
Always       -       0
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age    
Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age    
Always       -       0
207 Spin_High_Current       0x002a   253   252   000    Old_age    
Always       -       0
208 Spin_Buzz               0x002a   253   252   000    Old_age    
Always       -       0
209 Offline_Seek_Performnce 0x0024   196   187   000    Old_age    
Offline      -       0
99 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 1574 hours (65 days + 14  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 18 40 00 c8 e7  Error: ICRC, ABRT at LBA = 0x07c80040 =  
130547776
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 18 40 00 c8 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 18 40 00 c4 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 08 40 00 c0 e7 08  35d+03:28:04.832  WRITE DMA
  ca 00 20 40 00 bc e7 08  35d+03:28:04.816  WRITE DMA
  ca 00 18 40 00 b8 e7 08  35d+03:28:04.816  WRITE DMA
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime 
(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       
1617         -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute  
delay.
************** END smartctl -a /dev/hde2 *********************
************** BEGIN smartctl -a /dev/hdf2 *********************
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:     Maxtor 6Y120P0
Serial Number:    Y3Q0BARE
Firmware Version: YAR41BW0
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Sun Apr  2 16:02:56 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 244)	Self-test routine in progress...
					40% of test remaining.
Total time to complete Offline
data collection: 		 ( 242) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  54) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE       
UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   204   204   063    Pre-fail   
Always       -       4480
  4 Start_Stop_Count        0x0032   253   253   000    Old_age    
Always       -       22
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail   
Always       -       0
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail   
Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age    
Always       -       0
  8 Seek_Time_Performance   0x0027   250   249   187    Pre-fail   
Always       -       43683
  9 Power_On_Minutes        0x0032   247   247   000    Old_age    
Always       -       1007h+04m
10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail   
Always       -       0
11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail   
Always       -       0
12 Power_Cycle_Count       0x0032   253   253   000    Old_age    
Always       -       37
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age    
Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age    
Always       -       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age    
Always       -       25
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age    
Always       -       1439
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age    
Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age    
Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age    
Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   111   111   000    Old_age    
Offline      -       226
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age    
Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age    
Always       -       1
202 TA_Increase_Count       0x000a   253   252   000    Old_age    
Always       -       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail   
Always       -       0
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age    
Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age    
Always       -       0
207 Spin_High_Current       0x002a   253   252   000    Old_age    
Always       -       0
208 Spin_Buzz               0x002a   253   252   000    Old_age    
Always       -       0
209 Offline_Seek_Performnce 0x0024   195   188   000    Old_age    
Offline      -       0
99 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age    
Offline      -       0
SMART Error Log Version: 1
Warning: ATA error count 226 inconsistent with error log pointer 5
ATA Error Count: 226 (device log contains only the most recent five  
errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 226 occurred at disk power-on lifetime: 1950 hours (81 days + 6  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 f8 ad b5 f0  Error: ICRC, ABRT at LBA = 0x00b5adf8 =  
11906552
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 f8 ad b5 f0 08   1d+09:34:29.072  WRITE DMA
  ca 00 00 f8 ac b5 f0 08   1d+09:34:29.072  WRITE DMA
  ca 00 00 f8 ab b5 f0 08   1d+09:34:29.056  WRITE DMA
  ca 00 00 f8 aa b5 f0 08   1d+09:34:29.056  WRITE DMA
  ca 00 00 f8 a9 b5 f0 08   1d+09:34:29.056  WRITE DMA
Error 225 occurred at disk power-on lifetime: 1950 hours (81 days + 6  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 e0 00 f9 f0  Error: ICRC, ABRT at LBA = 0x00f900e0 =  
16318688
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 e0 00 f9 f0 08   1d+09:31:08.864  WRITE DMA
  ca 00 00 e0 ff f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 00 e0 fe f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 d8 00 fe f8 f0 08   1d+09:31:08.816  WRITE DMA
  ca 00 00 00 fd f8 f0 08   1d+09:31:08.816  WRITE DMA
Error 224 occurred at disk power-on lifetime: 1950 hours (81 days + 6  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 f0 00 e3 f0  Error: ICRC, ABRT at LBA = 0x00e300f0 =  
14876912
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 f0 00 e3 f0 08   1d+09:30:35.104  WRITE DMA
  ca 00 00 f0 ff e2 f0 08   1d+09:30:35.072  WRITE DMA
  ca 00 00 f0 fe e2 f0 08   1d+09:30:35.056  WRITE DMA
  ca 00 00 f0 fd e2 f0 08   1d+09:30:35.056  WRITE DMA
  ca 00 00 f0 fc e2 f0 08   1d+09:30:35.056  WRITE DMA
Error 223 occurred at disk power-on lifetime: 1950 hours (81 days + 6  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 c0 08 43 c6 f0  Error: ICRC, ABRT at LBA = 0x00c64308 =  
12993288
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 c8 42 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 41 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 40 c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 3f c6 f0 08   1d+09:28:45.024  WRITE DMA
  ca 00 00 c8 3e c6 f0 08   1d+09:28:45.008  WRITE DMA
Error 222 occurred at disk power-on lifetime: 1950 hours (81 days + 6  
hours)
  When the command that caused the error occurred, the device was in  
an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 18 b3 f0  Error: ICRC, ABRT at LBA = 0x00b31800 =  
11737088
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 00 00 18 b3 f0 08   1d+09:29:39.808  WRITE DMA
  c8 00 08 40 00 c0 f0 08   1d+09:29:39.808  READ DMA
  ca 00 00 00 17 b3 f0 08   1d+09:29:39.776  WRITE DMA
  ca 00 00 00 17 b3 f0 08   1d+09:29:39.680  WRITE DMA
  ca 00 08 f8 16 b3 f0 08   1d+09:29:39.648  WRITE DMA
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime 
(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       
1952         -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute  
delay.
************** END smartctl -a /dev/hdf2 *********************
Reply to: