Re: Please help - kernel crashes often
[root@lvs01 ~]# mcelog --k8 --ascii <mce3.txt
CPU 0 4 northbridge TSC 23f6fd4262e9
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 3d3c
bit40 = error found by scrub
bit45 = uncorrected ecc error
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS f41e21003d080a13 MCGSTATUS 7
Kernel panic - not syncing: Uncorrected machine check
and again :)
this is with chipkill and everything that should help debug/correct RAM enabled.
it took a while to crash but eventually it did.
I'm going to attack both RAM and the cooling.
On 2/1/06, mike <firstname.lastname@example.org> wrote:
> Yeah, I looked at the memory. It's got a PQI sticker (at least one set)
> It's SUPPOSED to be "(Samsung, Micron, Elpida, Infineon, Hynix OEM)" -
> which would align it basically with what Supermicro suggests:
> Anyway, I called Supermicro. I'm going to order their
> recommended/proper heatsink, air shroud, and then also call up the
> vendor I got the RAM from and tell them they did not deliver the
> proper stuff. They'll put up a fight, because they don't do good
> business - so tomorrow looks to be fun.
> Hopefully between those two any cooling and any RAM issues will be out
> of the equation.
> On 2/1/06, Paul Brook <email@example.com> wrote:
> > On Wednesday 01 February 2006 16:47, mike wrote:
> > > After running memtest86 (V3.3) for at least 24 hours, I came back and
> > > saw that each machine completed 61-63 cycles of tests, with 0
> > > errors...
> > >
> > > However, I did look through the BIOS for cache disabling - and it
> > > doesn't appear I can disable the CPU cache.
> > >
> > > I did turn on chipkill and some other supposed ECC memory "helpers"
> > > and instantly had the machine crash twice.
> > >
> > > [root@lvs01 ~]# mcelog --k8 --ascii <mce2.txt
> > > CPU 0 4 northbridge TSC 2
> > > Northbridge Chipkill ECC error
> > > Chipkill ECC syndrome = 6ca0
> > > bit32 = err cpu0
> > > bit45 = uncorrected ecc error
> > > bit57 = processor context corrupt
> > > bit61 = error uncorrected
> > > bus error 'local node origin, request didn't time out
> > > generic read mem transaction
> > > memory access, level generic'
> > > STATUS b65020016c080813 MCGSTATUS 4
> > > 332ff8453 ADDR 7ff5faf0
> > > Kernel panic - not syncing: Machine check
> > I had something similar, and it turned out the motherboard just didn't like
> > the brand/model of memory I was using. Replacing it with a different make
> > (this time one that was on the motherboard's recommended list) fixed the
> > problem.
> > Paul