on Tue, Apr 16, 2002, Jamin W. Collins (jcollins@asgardsrealm.net) wrote: > I'm hoping someone here might be able to give me a few pointers on how to > track down a strange system lockup. > > When the lockup occurs, both monitors drop to power saving mode and > the system from what I can tell is dead to the world (no network > connectivity and completely unresponsive to local access), and both > IDE CD drives show momentary access. The system does still have power > as the CD drives respond appropriately to their eject buttons. > > The system itself is a Dell Precision Workstation 220 with the > following hardware: <snip> > I have not been able to really pin the problem down to one particular > application. I have however noticed the problem more when I am using > Mozilla to browse. Frequently when I click on a link in Mozilla the > system will simply drop. However, I have seen drops when Mozilla was > not in use. So lets toss Mozilla as the culprit. > Originally, I suspected the problem had something to do with system > load and that it was simply happening more often with Mozilla due to > the load that Mozilla placed on the system. Then I suspected that the > problem might be memory related and only showing itself when the > system was using large amounts of memory. I have however tested the > system's memory using MemTest, and found no problem. Good first step. My experience is that memory has never been the source of an unstable system, but it's a relatively easy and conclusive test to attempt. > In some cases the system will go days/weeks without exhibiting the > problem and sometimes I'll see it several times in one day. The obvious question: what makes those days different (other than they're the ones your computer crashes). > Any ideas on how to pin a problem like this down? Possibly how to > track what the last few actions a system made before dropping out? You've got a tough situation: - Random lockups. - Infrequent lockups. - No system state (logs) indicating reasons for lockups. I tend to divide these problems into two general areas: - Hardware failure. - Software failure. The usual suspects, hardware: - Memory -- you've tested this - CPU -- a continuous kernel-build loop is a pretty good test. You're looking for SIG-11 errors. - Disk. Usually bad blocks or similar, though these almost *always* leave a logfile trace. - Other. E.g.: I had a flakey power supply on a laptop with Speedstep enabled on the CPU. The power cycling resulted in CPU clock-speed cycling, resulted in frequent hard system locks. The usual suspects, software: - Drivers. - Kernel. ...and it's almost always drivers (kernel modules). Diagnosing is difficult with infrequent, random lockups. Is there a period during which you'll almost certainly see a problem? Try removing everything. Unload all your drivers. Shut down X11. Run the system. See if it crashes, say, within a day. If not, load half your drivers. Stress the system (disk, CPU, memory). See if you get a crash, say, within a day. If not, unload these libs, and load the other half. If this doesn't work, try swapping components -- SCSI cards, modems, NICs, etc. If you can't swap them, then just pull them. This process can take a while, particularly in the circumstances you describe. Peace. -- Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? Moderator, Free Software Law Discussion mailing list: http://lists.alt.org/mailman/listinfo/fsl-discuss/
Attachment:
pgp1RBcQ3104P.pgp
Description: PGP signature