machine checks on Dell R815 under jessie
I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the
upgrade, they were running reliably for about 5 years. Since the upgrade, two
machines have been getting periodic machine checks. The machines boot fine and
run for a day or more. The machine checks appear to happen sporadically. I
can't determine a correlation with anything in particular.
The front panel on the first machine says the machine check was on CPU #4. The
front panel on the second machine said the first machine check was on CPU #1
and the second machine check was on CPU #2.
I am suspicious that this is really a hardware problem. Three CPUs begin
exhibiting machine checks within a few weeks of each other, all immediately
after upgrading wheezy to jessie, after working reliably for five years.
Has anybody else encountered this issue? Any suggestions on how to debug and
fix?
Thanks,
Jeff (http://engineering.purdue.edu/~qobi)
-------------------------------------------------------------------------------
root@arivu:~# ipmitool sel elist
1 | 08/05/2016 | 00:12:47 | Event Logging Disabled SEL | Log area reset/cleared | Asserted
2 | 08/06/2016 | 11:35:17 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted
3 | 08/06/2016 | 11:35:17 | Unknown #0x28 | | Asserted
4 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted
5 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted
6 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted
7 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted
8 | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted
9 | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted
a | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted
root@arivu:~#
root@perisikan:~# ipmitool sel elist
[...]
1c | 08/08/2016 | 12:23:02 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted
1d | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted
1e | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted
1f | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted
20 | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted
21 | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted
22 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted
23 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted
24 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted
25 | 08/09/2016 | 18:37:46 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted
26 | 08/09/2016 | 18:37:46 | Unknown #0x28 | | Asserted
27 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted
28 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted
29 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted
2a | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted
2b | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted
2c | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted
2d | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted
root@perisikan:~#
Reply to: