[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Help with system recovery



Brad Cramer said:
>
> be going on or how to fix this problem. What should I be looking for in
> log files? Could it be bad RAM? Any help would be greatly appreciated.


problems like this are the hardest to track down. There are several
things you can try to narrow it down.

BEFORE TESTING
===============
Get a null modem cable, and configure console on serial port on your
machine, if your not sure how to run a search for "linux serial console"
on most any search engine and a buncha hits should come up, connect your
system to another running a terminal emulation package(e.g. minicom) and
log the output to a file(you need to keep the emulation software up
all the time or messages may get lost).


Test 1
========
exit out of X, download, compile, and run 2-3 copies of CPUBurn available
here:
http://users.ev1.net/~redelm/

for the first few hours keep a close eye on the system, as the website
warns it can cause serious damage to the system if it is not properly cooled,
theres even been a reported case of a power supply burning out. If your
system is properly cooled you should be able to run a lot of CPUburn processes
and the system won't crash or reboot. If it does, stop here. I reccomend
running this for at least 24 hours. Do not use the computer while it
is running or it may skew results.

Test 2
==========
included in the cpuburn package is a memory tester, I reccomend running
this at a different time, but you can run it at the same time. Running
it at the same time may make it difficult to determine what caused
the crash(RAM or CPU). I reccomend running burnBX or burnMMX with the
'P' option(uses 64MB of ram) and run multiple copies of it(either load
up screen, or load them in the background with &) if you have 512MB of
ram I would load 7 or 8 copies. I reccomend running this test for
about 24 hours as well. As before, I reccomend not using the computer
while this is going on


Test 3
==========
Get memtest86 from http://www.memtest86.com/ compile it, make the
boot disk, and boot the disk. turn on the advanced tests(see the
documentation). This test will probably take 72 hours or more.
your computer will not be usable while this test is running.

Test 4
===========
Get bonnie++, and run it in a loop, I usually loop it for 72 hours
to test the disk and controller. redirect output to a log file so
you can monitor it. Again I reccomend not using your computer during
this time.


Test 5
=============
Since your using nvidia, I reccomend checking to make sure AGP is
disabled by checking /proc/driver/nvidia/agp. Also I reccomend
disabling AGP in X, using the option:

Option "NvAGP"   "0"

in the Device section of your X config, same place where you define
the driver.

and try using the system(with the serial console on the other computer)
see if it locks up still.


Test 6
===============
My next suggestion is try another kernel, preferably a 2.2.x kernel
which may be difficult if your using ext3, though you can probably
put the system in ext2 mode while using 2.2.x. I use 2.2.19 on all
my systems and don't have lockups.  Not too long ago my nvidia system
rebooted under intensive load but that was tracked down to a failed
fan on the cheap video card which brings me to ..


Test 7
================
perhaps the easiest and least intrusive test. open the side of
the case, point a fan(floor fan), at the internals, turn the
fan on medium or high so a ton of air gets blown into the case and
try to use the system, see if it locks up.

as you can probably see the procedures for tracking down a system
crash isn't easy, or fast..back when I had my Abit BP6 I spend
literally 6 months trying different things to solve the crashes
only to find out later that the board revision I had came with
a defect on the voltage regulators. In the process I spent WAY
more trying to fix the problem then I would of originally if I
had just gone out and bought a dual P2 instead of trying to go
cheap shit with celerons. I bought another board last year the
Asus A7A266 which had even worse problems, something with the
PCI bus or controller created immediate and complete filesystem
curroption on any disk connected to the system.

Also be sure you have a good quality power supply that provides enough
power for the system. my AMD Athlon 1300 runs off a PC Power & Cooling
TurboCool 425ATX. And it helps a lot if the system is connected to
a battery backup system. Bad power can easily cause lockups and reboots
without warning(such power problems may not be visible otherwise). If
it is a power issue, there may be permanent damage to the system already.

nate





Reply to: