[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

How to interprete machine check exception



Hi,

One of out our Opteron based machines here at work keeps crashing (with kernel 2.6.8/2.6.10). Last thing it prints on the console is (hand transcript):

CPU0: Machine Check Exception       4 Bank 0: b60ea00000000833
TSC 6e5cd030ae71
ADDR 258f8640

I've downloaded parsemce.c 0.0.9 from http://codemonkey.org.uk/cruft/. But I'm not sure about the correct way to call it (or if it even works for amd64). My guess would be the following command line. It that correct? And if yes, does it mean that i have faulty ram? Can machine check exceptions be triggered by faulty software (i.e. kernel bugs) or are they a sign of bad hardware?

Thanks in advance for any help,

- Ralf


$ ./parsemce -b 0 -s b60ea00000000833 -e 4 -a 258f8640 -V
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(0): b60ea00000000833 @ 258f8640
       External tag parity error
       Uncorrectable ECC error
       CPU state corrupt. Restart not possible
       Address in addr register valid
       Error enabled in control register
       Error not corrected.
       Bus and interconnect error
       Participation: Local processor originated request
       Timeout: Request did not timeout
       Request: Generic error
       Transaction type : Instruction
       Memory/IO : Other
parsemce version 0.0.9



Reply to: