[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Please help - kernel crashes often



After running memtest86 (V3.3) for at least 24 hours, I came back and
saw that each machine completed 61-63 cycles of tests, with 0
errors...

However, I did look through the BIOS for cache disabling - and it
doesn't appear I can disable the CPU cache.

I did turn on chipkill and some other supposed ECC memory "helpers"
and instantly had the machine crash twice.

[root@lvs01 ~]# mcelog --k8 --ascii <mce2.txt
CPU 0 4 northbridge TSC 2
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6ca0
       bit32 = err cpu0
       bit45 = uncorrected ecc error
       bit57 = processor context corrupt
       bit61 = error uncorrected
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS b65020016c080813 MCGSTATUS 4
332ff8453 ADDR 7ff5faf0
Kernel panic - not syncing: Machine check

[root@lvs01 ~]# mcelog --k8 --ascii <mce.txt
CPU 0 4 northbridge TSC 34096547a5
RIP 10:ffffffff8010c275
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6ca0
       bit40 = error found by scrub
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS f45021006c080a13 MCGSTATUS 7
RIP: default_idle+0x22/0x25}
Kernel panic - not syncing: Uncorrected machine check

I tried running the same thing that I make it crash with (just a
simple make -j2 on kernel sources) with only 1 DIMM at a time, to see
if I could figure out if either one was to blame; neither failed after
a few minutes. Now I've put them both back (but in the opposite slot)
and so far it's been running. But that is the nature of this issue -
it can happen after 10 minutes or 10 hours... and I can't have that!



Reply to: