[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Western Digital drive and ASUS P4PE motherboard: do they get along?



Two months ago, I set up a new machine with an ASUS P4PE motherboard,
and a single 120GB disk (WD1200JB-00DUA0, ATA DISK drive).  I'm using
Debian/woody with kernel 2.4.20 compiled using kernel-package.

ObDebian: would there be any bad interaction between the debian-patched
2.4.20 kernel (I took it from package kernel-source-2.4.20) and the
other software in woody?  Besides the kernel, the only packages I
built (from debian/unstable sources) were: initrd-tools (0.1.42),
and kernel-package (8.025).  Later, I updated nfs-utils to version
1.0.3 in hopes that it cured the "erroneous SM_UNMON request" messages
(it didn't help) -- see below.


The machine ran (with no load) for two months.  The day after I put it
into service, it crashed :(

It has crashed three times in two weeks, each time with disk I/O
errors on the console:

Apr 20 14:14:27 jeff kernel: hda: dma_intr: status=0x11 { SeekComplete Error }
Apr 20 14:14:27 jeff kernel: hda: dma_intr: error=0x04 { DriveStatusError }
Apr 20 14:14:27 jeff kernel: hda: status error: status=0x11 { SeekComplete Error }
Apr 20 14:14:27 jeff kernel: hda: status error: error=0x04 { DriveStatusError }
Apr 20 14:14:27 jeff kernel: hda: drive not ready for command
Apr 20 14:14:27 jeff kernel: hda: status error: status=0x11 { SeekComplete Error }
Apr 20 14:14:27 jeff kernel: hda: status error: error=0x04 { DriveStatusError }
Apr 20 14:14:27 jeff kernel: hda: drive not ready for command
Apr 20 14:14:27 jeff kernel: hda: status error: status=0x11 { SeekComplete Error }
Apr 20 14:14:27 jeff kernel: hda: status error: error=0x04 { DriveStatusError }
Apr 20 14:14:27 jeff kernel: hda: DMA disabled
Apr 20 14:14:27 jeff kernel: hda: drive not ready for command

At this point, the machine is locked solid, and I need to power-cycle.
Simply pressing the reset button doesn't work -- "no system disk" is
the boot message.

The odd thing is that the system doesn't show any disk-related errors
in the logs --- until the crash.  If it were truly a disk failure, I 
would expect some "recoverable" errors while the system is running.
Moreover, "smartctl -a" claims the disk is healthy.

I do see odd networking-related message, on the other hand.  I don't
know whether they are related or a red herring, but briefly, I see
the following four "unusual" types of messages

1. rpc.statd[263]: Received erroneous SM_UNMON request from X for Y
2. rpc.statd[263]: notify_host: failed to notify 127.0.0.1
3. kernel: sending pkt_too_big (len[1500] pmtu[1452]) to self
4. kernel: TCP: Treason uncloaked! Peer 138.202.33.24:50086/80 shrinks window 1617230634:1617237286. Repaired.

I had even more problems initially, using the on-board broadcom
ethernet adaptor, so I switched to a reliable Intel Ethernet Pro 100.
I still get all four types of messages.

Thanks,
-Steve




Reply to: