[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Please help - kernel crashes often

On Tue, Jan 31, 2006 at 01:45:01AM -0800, mike wrote:
> Yes, I was able to go down and get on the console, record it, and
> found a thread on how to decypher it.
> The MCE was:
> CPU 0: Machine Check Exception:         4 Bank 0: f60da00000000833
> TSC 23fd7acec1e ADDR 797db2c0
> Kernel panic - not syncing: Machine check
> the output from "mcelog" was:
> web03:~# mcelog --k8 --ascii <mce.txt
> CPU 0 0 data cache TSC 23fd7acec1e
>   Data cache ECC error (syndrome 1b)
>        bit45 = uncorrected ecc error
>        bit57 = processor context corrupt
>        bit61 = error uncorrected
>        bit62 = error overflow (multiple errors)
>   bus error 'local node origin, request didn't time out
>       data read mem transaction
>       memory access, level generic'
> STATUS f60da00000000833 MCGSTATUS 4
> Kernel panic - not syncing: Machine check
> I've been running memtest86 V3.3 (if I recall the exact title) on all
> the machines starting earlier today and will be looking at them in the
> next day or two to figure out what they say.
> One thing that disturbs me is that it shows ECC: no in memtest, even
> when I force enable it on - and the RAM is most definately ECC...

It seems to say _data_cache_ ecc error, not ram ecc error.  Sounds like
a cpu has a defective cache.  Well unless that is how the cpu reports
when a transfer between cache and ram had an ecc failure.  Need an
expert on the k8 to answer that I think. :)

1U machines do tend to run hot, due to limited space for cooling and
packing everything in so tight.  A cpu that is marginal is much more
likely to fail in such conditions.  If the bios has an option to disable
the cache, maybe you could see if that makes it stable.  If it does,
then you pretty much know where the broken hardware is.

Len Sorensen

Reply to: