Troubleshooting occasional hdX lost interrupts - any suggestions?
Every few days, I get the kernel error "hdX: lost interrupt" where X is
usually c or g.
I'm having a hard time tracking down any systematic way of
troubleshooting this problem.
hdg is a brand new drive and ran for a couple of weeks in another system
without a blip, so I don't think it is a problem with the drive itself.
There are also no SMART errors appearing on any drives.
I have replaced the ribbon cable connecting the drive to the controller.
hdc and hdg, which both occasionally get lost interrupts, are on
different controllers--and, in fact, on diffferent sorts of controllers.
One is a VIA vt8235 IDE UDMA133, the other is a RAID Controller Triones
Technologies HPT366/368/370/370A/372.
I was using Debian stock kernel 2.6.8-2-k7; now I'm using a custom built
vanilla 2.6.15.4. I haven't figured out if there is a real statistical
difference in the number of errors with each--I may be getting them
slightly more frequently with 2.6.15.4 but I don't have enough data
points to be sure.
I also *seemed* to be getting them more frequently when I had a UPS
installed. Since I've taken the UPS out and connected the CPU directly to
a power socket, they seem to be rarer and are not accompanied by any dma
timeout errors, but again I'm not certain this is statistically
significant.
/proc/interrupts says:
CPU0
0: 32453965 XT-PIC timer
1: 16 XT-PIC i8042
2: 0 XT-PIC cascade
5: 0 XT-PIC uhci_hcd:usb2
8: 4 XT-PIC rtc
10: 3554483 XT-PIC ide2, ide3, uhci_hcd:usb3
11: 9589616 XT-PIC uhci_hcd:usb1, eth0, eth1
12: 0 XT-PIC ehci_hcd:usb4
14: 2235942 XT-PIC ide0
15: 1836402 XT-PIC ide1
NMI: 0
LOC: 32454287
ERR: 12990
MIS: 0
/proc/ioports:
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial
0cf8-0cff : PCI conf1
4000-407f : 0000:00:11.0
5000-500f : 0000:00:11.0
c000-c0ff : 0000:00:0c.0
c000-c0ff : r8169
c400-c4ff : 0000:00:0e.0
c800-c807 : 0000:00:0f.0
c800-c807 : ide2
cc00-cc03 : 0000:00:0f.0
cc02-cc02 : ide2
d000-d007 : 0000:00:0f.0
d000-d007 : ide3
d400-d403 : 0000:00:0f.0
d402-d402 : ide3
d800-d8ff : 0000:00:0f.0
d800-d807 : ide2
d808-d80f : ide3
d810-d8ff : HPT372
dc00-dc1f : 0000:00:10.0
dc00-dc1f : uhci_hcd
e000-e01f : 0000:00:10.1
e000-e01f : uhci_hcd
e400-e41f : 0000:00:10.2
e400-e41f : uhci_hcd
e800-e80f : 0000:00:11.1
e800-e807 : ide0
e808-e80f : ide1
ec00-ecff : 0000:00:12.0
ec00-ecff : via-rhine
I have one drive from each controller in a software RAID-5: hda, hdc,
hde, and hdh.
Any suggestions for how to go about diagnosing the problem?
--
Adam Rosi-Kessel
http://adam.rosi-kessel.org
Reply to: