[OT] Hardware failure?
For the last several days, I've been experiencing strange lock-ups and
crashes, which I suspect may be due to hardware failure, although I'm
not sure how to diagnose this further.
I don't think that it's an OS issue, since the problem sometimes occurs
at POST, or at least before the bootloader (grub) comes up.
The failures seem to cluster; I've had repeated hangs within a few
minutes, and then good running for days.
I suspect it may be a HDD / controller problem; a little while ago, I
didn't get an actual hang (although I had seen several minutes before
that) but some applications temporarily stopped responding, and I saw
this in syslog:
Sep 7 19:36:08 localhost kernel: [ 193.761021] ata1: drained 65536 bytes to clear DRQ.
Sep 7 19:36:08 localhost kernel: [ 193.876071] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 7 19:36:08 localhost kernel: [ 193.876077] ata1.00: failed command: READ DMA
Sep 7 19:36:08 localhost kernel: [ 193.876085] ata1.00: cmd c8/00:e8:51:00:98/00:00:00:00:00/e2 tag 0 dma 118784 in
Sep 7 19:36:08 localhost kernel: [ 193.876087] res 40/00:01:01:4f:c2/00:00:00:00:00/a0 Emask 0x4 (timeout)
Sep 7 19:36:08 localhost kernel: [ 193.876091] ata1.00: status: { DRDY }
Sep 7 19:36:08 localhost kernel: [ 193.876127] ata1: soft resetting link
Sep 7 19:36:14 localhost kernel: [ 199.076056] ata1: link is slow to respond, please be patient (ready=0)
Sep 7 19:36:18 localhost kernel: [ 203.921020] ata1: SRST failed (errno=-16)
Sep 7 19:36:18 localhost kernel: [ 203.921034] ata1: soft resetting link
Sep 7 19:36:24 localhost kernel: [ 209.121055] ata1: link is slow to respond, please be patient (ready=0)
Sep 7 19:36:28 localhost kernel: [ 213.967058] ata1: SRST failed (errno=-16)
Sep 7 19:36:28 localhost kernel: [ 213.967072] ata1: soft resetting link
Sep 7 19:36:34 localhost kernel: [ 219.168044] ata1: link is slow to respond, please be patient (ready=0)
Sep 7 19:36:59 localhost kernel: [ 244.977129] ata1.01: link status unknown, clearing UNKNOWN to NONE
Sep 7 19:37:00 localhost kernel: [ 245.385606] ata1.00: configured for UDMA/100
Sep 7 19:37:00 localhost kernel: [ 245.385623] ata1: EH complete
The last three lines seem to be from when the system began behaving
normally again. This certainly looks bad; anyone know what it means?
I'm running SMART tests, but so far I haven't seen anything that looks
funny there, although I don't really grok the SMART information.
The machine is a nearly four year old Acer Aspire laptop. The HDD, as
reported by SMART, is:
Model Family: Hitachi Travelstar 5K100
Device Model: HTS541060G9AT00
Serial Number: MPB3PAXMG2SR2G
Firmware Version: MB3OA60A
Celejar
--
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator
Reply to: