[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Please help - kernel crashes often



[root@lvs01 ~]# mcelog --k8 --ascii <mce3.txt
CPU 0 4 northbridge TSC 23f6fd4262e9
RIP 00:413bd600000000
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 3d3c
       bit40 = error found by scrub
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS f41e21003d080a13 MCGSTATUS 7
Kernel panic - not syncing: Uncorrected machine check

and again :)

this is with chipkill and everything that should help debug/correct RAM enabled.

it took a while to crash but eventually it did.

I'm going to attack both RAM and the cooling.


On 2/1/06, mike <mike503@gmail.com> wrote:
> Yeah, I looked at the memory. It's got a PQI sticker (at least one set)
>
> It's SUPPOSED to be "(Samsung, Micron, Elpida, Infineon, Hynix OEM)" -
> which would align it basically with what Supermicro suggests:
>
> http://supermicro.com/Aplus/support/resources/memory/?sz=1.0&mspd=0.4&mtyp=9&id=51EF70624CA791283EC434A52DA0D4E2
>
> Anyway, I called Supermicro. I'm going to order their
> recommended/proper heatsink, air shroud, and then also call up the
> vendor I got the RAM from and tell them they did not deliver the
> proper stuff. They'll put up a fight, because they don't do good
> business - so tomorrow looks to be fun.
>
> Hopefully between those two any cooling and any RAM issues will be out
> of the equation.
>
> On 2/1/06, Paul Brook <paul@codesourcery.com> wrote:
> > On Wednesday 01 February 2006 16:47, mike wrote:
> > > After running memtest86 (V3.3) for at least 24 hours, I came back and
> > > saw that each machine completed 61-63 cycles of tests, with 0
> > > errors...
> > >
> > > However, I did look through the BIOS for cache disabling - and it
> > > doesn't appear I can disable the CPU cache.
> > >
> > > I did turn on chipkill and some other supposed ECC memory "helpers"
> > > and instantly had the machine crash twice.
> > >
> > > [root@lvs01 ~]# mcelog --k8 --ascii <mce2.txt
> > > CPU 0 4 northbridge TSC 2
> > >   Northbridge Chipkill ECC error
> > >   Chipkill ECC syndrome = 6ca0
> > >        bit32 = err cpu0
> > >        bit45 = uncorrected ecc error
> > >        bit57 = processor context corrupt
> > >        bit61 = error uncorrected
> > >   bus error 'local node origin, request didn't time out
> > >       generic read mem transaction
> > >       memory access, level generic'
> > > STATUS b65020016c080813 MCGSTATUS 4
> > > 332ff8453 ADDR 7ff5faf0
> > > Kernel panic - not syncing: Machine check
> >
> > I had something similar, and it turned out the motherboard just didn't like
> > the brand/model of memory I was using. Replacing it with a different make
> > (this time one that was on the motherboard's recommended list) fixed the
> > problem.
> >
> > Paul
> >
>



Reply to: