[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Please help - kernel crashes often



On Wed, Feb 01, 2006 at 08:47:30AM -0800, mike wrote:
> After running memtest86 (V3.3) for at least 24 hours, I came back and
> saw that each machine completed 61-63 cycles of tests, with 0
> errors...
> 
> However, I did look through the BIOS for cache disabling - and it
> doesn't appear I can disable the CPU cache.
> 
> I did turn on chipkill and some other supposed ECC memory "helpers"
> and instantly had the machine crash twice.
> 
> [root@lvs01 ~]# mcelog --k8 --ascii <mce2.txt
> CPU 0 4 northbridge TSC 2
>   Northbridge Chipkill ECC error
>   Chipkill ECC syndrome = 6ca0
>        bit32 = err cpu0
>        bit45 = uncorrected ecc error
>        bit57 = processor context corrupt
>        bit61 = error uncorrected
>   bus error 'local node origin, request didn't time out
>       generic read mem transaction
>       memory access, level generic'
> STATUS b65020016c080813 MCGSTATUS 4
> 332ff8453 ADDR 7ff5faf0
> Kernel panic - not syncing: Machine check

Doing a quick google search on "Northbridge Chipkill ECC error" found
this interesting thread:
http://lkml.org/lkml/2006/1/12/385

It certainly appears to point at bad memory, or potentially an
overheating problem.  The chipkill feature warns when ECC finds and
error and has to correct it in memory.  In your case it even seems to be
saying that if found a memory error that it couldn't correct.

You might have to try and run with half the memory for a number of hours
to see if it fails, given it could be heat related.  Of course memtest
doesn't really stress the cpu or disk or video, so you get a lot less
heat created while running memtest.  Kernel compiles are good though. :)

Len Sorensen



Reply to: