[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hardware failure -- how to find out?



Cassiano Leal wrote:
People,

I have at work a mixed system Debian sarge/etch running a firewall and vpn server on a K6-2.

Today, we were experiencing some connectivity problems, and we found out that they were caused by iptables not initiating properly and segfaulting. So, I went into the servers room, to find out that the computer in question was beeping constantly, which led me to believe it was a hardware failure.

If it was the speaker then it could mean the motherboard was resetting while still in the BIOS.


The computer wouldn't respond to 'init 6' via ssh, so I tried to Ctrl-Alt-Delete locally, without success. Tried to login, again to fail.

You probable got a kernel oops or your root hard drive went offline. A kernel oops will usually make it to the console but not to the logs. It's usually caused by CPU overheating or memory problems, and less commonly by a failing power supply or bad motherboard capacitors. If you have an old motherboard (over 5 years old) check for bulged or burned out capacitors. Make sure your fans are working, and clear of dust.

If your hard drive went offline it could be a failing drive, bad data cable or bad power connector. If the drive is S.M.A.R.T.-aware you can check it with smartctl. You can also surface scan with fsck or better, download and run the manufacturer's diagnostics for that drive.


So, the solution was to hit the power button and expect it to work. To my luck, it did straight away and we are at the moment working without a problem.

But I couldn't yet state what have caused the problem. So, my question is: how do I trace this failure? Log files? Which ones?

If it was caused by a hard drive error then it will show up in the /var/log/message and /var/log/syslog unless the disk went offline before any errors were logged. If you are ambitious you can use the kernel's core dump driver/module, make a core dump after the crash and examine it with gdb. You probably want to dump to a device than your root partition drive in case of hard drive problems.


Any other tests I can run?

I use three kinds of tests, usually at the same time to maximize stress. One is a version a bash script called "burnit" which was used to test for a K6 bug. It does repeated kernel compiles and runs checksums on the object files, making sure they are the same for each loop. This is the best single test I've found for stress testing PCs.

Another good test is running debsums on all installed packages. I run it with the -c flag to minimize output, and since I have a local debian archive, I also use the --generate=all option to add LAN traffic.

The third test is memtest, which I run on the memory which I don't need for the other tests.

I have never had a hardware problem that wasn't revealed by these tests, but some infrequent failures took hours or days to show up.


Please, bear in mind that this is a production firewall system.

Good reason to set up a backup firewall. Any old PC will do. The backup firewall can serve as a replacement while you do the stress testing. After fixing the failing PC, you can keep keep the backup running and ready to replace the primary firewall at any time.



Reply to: