How to interprete machine check exception
Hi,
One of out our Opteron based machines here at work keeps crashing (with
kernel 2.6.8/2.6.10). Last thing it prints on the console is (hand
transcript):
CPU0: Machine Check Exception 4 Bank 0: b60ea00000000833
TSC 6e5cd030ae71
ADDR 258f8640
I've downloaded parsemce.c 0.0.9 from http://codemonkey.org.uk/cruft/.
But I'm not sure about the correct way to call it (or if it even works
for amd64).
My guess would be the following command line. It that correct? And if
yes, does it mean that i have faulty ram? Can machine check exceptions
be triggered by faulty software (i.e. kernel bugs) or are they a sign of
bad hardware?
Thanks in advance for any help,
- Ralf
$ ./parsemce -b 0 -s b60ea00000000833 -e 4 -a 258f8640 -V
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(0): b60ea00000000833 @ 258f8640
External tag parity error
Uncorrectable ECC error
CPU state corrupt. Restart not possible
Address in addr register valid
Error enabled in control register
Error not corrected.
Bus and interconnect error
Participation: Local processor originated request
Timeout: Request did not timeout
Request: Generic error
Transaction type : Instruction
Memory/IO : Other
parsemce version 0.0.9
Reply to: