[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [OT] Hardware failure?



 On 9/7/2010 5:20 PM, Celejar wrote:
For the last several days, I've been experiencing strange lock-ups and
crashes, which I suspect may be due to hardware failure, although I'm
not sure how to diagnose this further.

I don't think that it's an OS issue, since the problem sometimes occurs
at POST, or at least before the bootloader (grub) comes up.

The failures seem to cluster; I've had repeated hangs within a few
minutes, and then good running for days.

I suspect it may be a HDD / controller problem; a little while ago, I
didn't get an actual hang (although I had seen several minutes before
that) but some applications temporarily stopped responding, and I saw
this in syslog:

Sep  7 19:36:08 localhost kernel: [  193.761021] ata1: drained 65536 bytes to clear DRQ.
Sep  7 19:36:08 localhost kernel: [  193.876071] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  7 19:36:08 localhost kernel: [  193.876077] ata1.00: failed command: READ DMA
Sep  7 19:36:08 localhost kernel: [  193.876085] ata1.00: cmd c8/00:e8:51:00:98/00:00:00:00:00/e2 tag 0 dma 118784 in
Sep  7 19:36:08 localhost kernel: [  193.876087]          res 40/00:01:01:4f:c2/00:00:00:00:00/a0 Emask 0x4 (timeout)
Sep  7 19:36:08 localhost kernel: [  193.876091] ata1.00: status: { DRDY }
Sep  7 19:36:08 localhost kernel: [  193.876127] ata1: soft resetting link
Sep  7 19:36:14 localhost kernel: [  199.076056] ata1: link is slow to respond, please be patient (ready=0)
Sep  7 19:36:18 localhost kernel: [  203.921020] ata1: SRST failed (errno=-16)
Sep  7 19:36:18 localhost kernel: [  203.921034] ata1: soft resetting link
Sep  7 19:36:24 localhost kernel: [  209.121055] ata1: link is slow to respond, please be patient (ready=0)
Sep  7 19:36:28 localhost kernel: [  213.967058] ata1: SRST failed (errno=-16)
Sep  7 19:36:28 localhost kernel: [  213.967072] ata1: soft resetting link
Sep  7 19:36:34 localhost kernel: [  219.168044] ata1: link is slow to respond, please be patient (ready=0)

Sep  7 19:36:59 localhost kernel: [  244.977129] ata1.01: link status unknown, clearing UNKNOWN to NONE
Sep  7 19:37:00 localhost kernel: [  245.385606] ata1.00: configured for UDMA/100
Sep  7 19:37:00 localhost kernel: [  245.385623] ata1: EH complete

The last three lines seem to be from when the system began behaving
normally again.  This certainly looks bad; anyone know what it means?

I'm running SMART tests, but so far I haven't seen anything that looks
funny there, although I don't really grok the SMART information.

The machine is a nearly four year old Acer Aspire laptop.  The HDD, as
reported by SMART, is:

Model Family:     Hitachi Travelstar 5K100
Device Model:     HTS541060G9AT00
Serial Number:    MPB3PAXMG2SR2G
Firmware Version: MB3OA60A

Celejar

It looks to me that it's getting timeouts and hangs from reading the hard drive, not actual bad sectors. This could be a bad SATA (or IDE) cable, or controller problems.

Hmm, I just noticed it's a laptop, so replacing the cable is not an option (there's no cable to replace). I had an old laptop where I have to wedge something underneath the hard drive to get it to make good contact with the connectors. You could try similar experiments.


Reply to: