[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Intermittent System Lockup (long)



on Tue, Apr 16, 2002, Jamin W. Collins (jcollins@asgardsrealm.net) wrote:
> I'm hoping someone here might be able to give me a few pointers on how to
> track down a strange system lockup.
> 
> When the lockup occurs, both monitors drop to power saving mode and
> the system from what I can tell is dead to the world (no network
> connectivity and completely unresponsive to local access), and both
> IDE CD drives show momentary access.  The system does still have power
> as the CD drives respond appropriately to their eject buttons.
> 
> The system itself is a Dell Precision Workstation 220 with the
> following hardware:

<snip>

> I have not been able to really pin the problem down to one particular
> application.  I have however noticed the problem more when I am using
> Mozilla to browse.  Frequently when I click on a link in Mozilla the
> system will simply drop.  However, I have seen drops when Mozilla was
> not in use.

So lets toss Mozilla as the culprit.

> Originally, I suspected the problem had something to do with system
> load and that it was simply happening more often with Mozilla due to
> the load that Mozilla placed on the system.  Then I suspected that the
> problem might be memory related and only showing itself when the
> system was using large amounts of memory.  I have however tested the
> system's memory using MemTest, and found no problem.

Good first step.  My experience is that memory has never been the source
of an unstable system, but it's a relatively easy and conclusive test to
attempt.

> In some cases the system will go days/weeks without exhibiting the
> problem and sometimes I'll see it several times in one day.

The obvious question:  what makes those days different (other than
they're the ones your computer crashes).

> Any ideas on how to pin a problem like this down?  Possibly how to
> track what the last few actions a system made before dropping out?

You've got a tough situation:

  - Random lockups.
  - Infrequent lockups.
  - No system state (logs) indicating reasons for lockups.

I tend to divide these problems into two general areas:

  - Hardware failure.
  - Software failure.


The usual suspects, hardware:

  - Memory -- you've tested this

  - CPU -- a continuous kernel-build loop is a pretty good test.  You're
    looking for SIG-11 errors.

  - Disk.  Usually bad blocks or similar, though these almost *always*
    leave a logfile trace.

  - Other.  E.g.:  I had a flakey power supply on a laptop with
    Speedstep enabled on the CPU.  The power cycling resulted in CPU
    clock-speed cycling, resulted in frequent hard system locks.


The usual suspects, software:

  - Drivers.

  - Kernel.

...and it's almost always drivers (kernel modules).

Diagnosing is difficult with infrequent, random lockups.  Is there a
period during which you'll almost certainly see a problem?

Try removing everything.  Unload all your drivers.  Shut down X11.  Run
the system.  See if it crashes, say, within a day.

If not, load half your drivers.  Stress the system (disk, CPU, memory).
See if you get a crash, say, within a day.  If not, unload these libs,
and load the other half.

If this doesn't work, try swapping components -- SCSI cards, modems,
NICs, etc.  If you can't swap them, then just pull them.

This process can take a while, particularly in the circumstances you
describe.

Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
   Moderator, Free Software Law Discussion mailing list:
     http://lists.alt.org/mailman/listinfo/fsl-discuss/

Attachment: pgp1RBcQ3104P.pgp
Description: PGP signature


Reply to: