on Tue, Apr 16, 2002, Jamin W. Collins (jcollins@asgardsrealm.net) wrote:
> I'm hoping someone here might be able to give me a few pointers on how to
> track down a strange system lockup.
>
> When the lockup occurs, both monitors drop to power saving mode and
> the system from what I can tell is dead to the world (no network
> connectivity and completely unresponsive to local access), and both
> IDE CD drives show momentary access. The system does still have power
> as the CD drives respond appropriately to their eject buttons.
>
> The system itself is a Dell Precision Workstation 220 with the
> following hardware:
<snip>
> I have not been able to really pin the problem down to one particular
> application. I have however noticed the problem more when I am using
> Mozilla to browse. Frequently when I click on a link in Mozilla the
> system will simply drop. However, I have seen drops when Mozilla was
> not in use.
So lets toss Mozilla as the culprit.
> Originally, I suspected the problem had something to do with system
> load and that it was simply happening more often with Mozilla due to
> the load that Mozilla placed on the system. Then I suspected that the
> problem might be memory related and only showing itself when the
> system was using large amounts of memory. I have however tested the
> system's memory using MemTest, and found no problem.
Good first step. My experience is that memory has never been the source
of an unstable system, but it's a relatively easy and conclusive test to
attempt.
> In some cases the system will go days/weeks without exhibiting the
> problem and sometimes I'll see it several times in one day.
The obvious question: what makes those days different (other than
they're the ones your computer crashes).
> Any ideas on how to pin a problem like this down? Possibly how to
> track what the last few actions a system made before dropping out?
You've got a tough situation:
- Random lockups.
- Infrequent lockups.
- No system state (logs) indicating reasons for lockups.
I tend to divide these problems into two general areas:
- Hardware failure.
- Software failure.
The usual suspects, hardware:
- Memory -- you've tested this
- CPU -- a continuous kernel-build loop is a pretty good test. You're
looking for SIG-11 errors.
- Disk. Usually bad blocks or similar, though these almost *always*
leave a logfile trace.
- Other. E.g.: I had a flakey power supply on a laptop with
Speedstep enabled on the CPU. The power cycling resulted in CPU
clock-speed cycling, resulted in frequent hard system locks.
The usual suspects, software:
- Drivers.
- Kernel.
...and it's almost always drivers (kernel modules).
Diagnosing is difficult with infrequent, random lockups. Is there a
period during which you'll almost certainly see a problem?
Try removing everything. Unload all your drivers. Shut down X11. Run
the system. See if it crashes, say, within a day.
If not, load half your drivers. Stress the system (disk, CPU, memory).
See if you get a crash, say, within a day. If not, unload these libs,
and load the other half.
If this doesn't work, try swapping components -- SCSI cards, modems,
NICs, etc. If you can't swap them, then just pull them.
This process can take a while, particularly in the circumstances you
describe.
Peace.
--
Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Moderator, Free Software Law Discussion mailing list:
http://lists.alt.org/mailman/listinfo/fsl-discuss/
Attachment:
pgp1RBcQ3104P.pgp
Description: PGP signature