[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: hardware problem? continual errors with drive and kernel



[This message has also been posted to linux.debian.user.]
In article <7MNYA-1sp-15@gated-at.bofh.it>, Roberto C. Sanchez wrote:
> On Fri, Feb 09, 2007 at 01:50:40PM -0800, Ric Otte wrote:
>>=20
>> I was wondering if anyone else had had problems like this and had any
>> suggestions.  I don't know much about hardware and don't begin to know wh=

> [Kernel oops in stable code, suggesting cpu-mem issues]

> Get a Knoppix CD or DVD and run memtest on your system for a couple of
> days.

Have you *ever* seen memtest catch a pattern sensitive
memory failure?  Memtest is good for finding stuck bits
and stuck address lines.  But there is another class
of failure: "walking wounded" chips with electrostatic
discharge damage, bad signal integrity, failed or inadequate
bypass caps, cracked traces with high resistance bridges.
These can give unreproducable soft memory
errors when you hit just the right address sequence
with just the right pattern in RAM.

I've had bad motherboards that could run memtest for days,
at all temperatures.  But they'd give a kernel oops
or "signal 11" before they got to the end of a kernel
compile.  GCC generates a more chaotic pattern set
than memtest does.  I had one where sliding the scroll
bar up and down on an xterm would do it.

Desktop-class PC hardware doesn't provide fault isolation
tools.  You just have to replace parts one at a time.
First measure the supply voltages.  You can chase an
inadequate power supply for days; it makes everything
else look flaky.  If you find 12V, 5V, and 3.3V
within 5% of nominal, with the CD spinning and hard
drive seeking (fsck -nf), move on to other things.
Next look at BIOS and correct any "overclocking"
or "aggressive" timing settings that some fool may
have left on the board.

You've already eliminated the hard drive.
Try the cable, just to rule it out.
I haven't seen a flaky CPU in years, they seem to
fail all the way when they fail.  
Memory can be bad, especially due to ESD.  But my
bet is on the motherboard.  Motherboards get micro
cracks, bad solder, wrong value
bypass caps, all kinds of hard to isolate stuff.
People don't handle them delicately enough, and they're
barely strong enough not to crack when you clamp
those cooler retainers down.  Even Intel and Tyan
and Asus.


Cameron



Reply to: