[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: systems hangs every few days



On 6/18/2013 7:59 AM, Chris Purves wrote:
> After upgrading to wheezy, I get a system hang every one or two days where the system becomes completely unresponsive and I need do a cold boot.  
> 
> This is an older machine with an Athlon processor.  I'm not running X.  I don't see anything unusual in the logs.  The last entry in syslog is typically a cron job, but not always the same one.  The system seems to freeze without any warning.
> 
> I tried downgrading the kernel back to the squeeze version (2.6) and it still locks up.  Before upgrading to wheezy I resized a few of the partitions.  Other than that, nothing else has changed and everything had been running fine for years.
> 
> I'd appreciate any help in debugging this problem.

It's not a kernel/software problem as you're not seeing kernel panics,
nothing in the logs.  Could be DRAM but it's unlikely.  Given that
marginal silicon typically fails within hours/days of initial use and
rarely thereafter, it's probably not a DIMM gone bad as someone else
suggested.

You said this system is "older" and housing an Athlon CPU.  There were 5
generations of Athlon produced from 1999 to 2005.  Thus this box could
be anywhere from 7 to 14 years old.  On machines of this age you need to
check/test/troubleshoot/replace hardware in the following order:

1.  CPU fan -- rarely last 7 years, let alone 14.  Some models may lose
    80% of their nominal RPM with age, yet without emitting noticeable
    noise.  The heatsink may get just enough airflow to allow a few
    days of run time.  When the fan fails completely, the box locks up
    in a few minutes.

2.  PSU fan -- while failing will cause MOSFET/cap/etc overheating which
    can cause "random" lockups, reboots, and other odd behavior

3.  PSU itself -- failed fan can permanently damage MOSFETS/caps/etc
    Even with a good fan, PSU components can fail with age.

4.  Removable media drives -- floppy/CD/DVD-ROM can fail in odd ways
    sending spurious high voltage signals or shorting wires, locking up
    the motherboard, or causing random reboots.  Disconnect their
    data cable and power leads and run without them.

5.  The motherboard.  Even with good cooling over the life of a machine
    the motherboard can still simply fail.  You may not be able to find
    bulged caps nor burn marks on VRMs, no visible signs of failure.

    Point in fact:  I had a Biostar Socket A nForce2 400 motherboard
    w/Athlon XP 2500 simply give up the ghost in 2011 in a similar
    manner.  It locked up a few times over a period of a week or so,
    then simply wouldn't post.

    I built that machine in Aug 2003 and it lasted 8 years.  I started
    with two 92x25mm Panaflow case fans, plus the 80x25 PSU fan.
    I replaced the PSU fan with an NMB boxer, and the case fans with
    two Nidec Beta Vs, twice during the life of the box.  All of the
    fans were fully functional at the time of replacement.  This was
    proactive maintenance.  The box had 110 CFM of properly directed
    airflow during its lifespan.  Compare this to the ~30 CFM of a
    quiet Dell, HP, or IBM machine.  Anyone who knows hardware knows
    that these are all top shelf 12VDC fans.  The PSU is still running,
    in another box, as are the two DIMMS and the CPU.  The motherboard
    simply gave up the ghost after 8 years of 24x7 operation.  Let's
    hope that isn't the case here.


-- 
Stan


Reply to: