[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: EDAC errors: false positives or broken RAM? (i5000)

On Tue, 16 Aug 2011, Jonas Meurer wrote:
> Now, the messages keep spamming my log and console for more than three
> weeks already. A some days I get more than 36000 errors a day.
> It's noteable, that every DRAM-Bank from 0 to 7 is affected.
> Now I wonder, whether these are false positives (searching for the
> errors in the web revealed that these are quite common), or whether my
> RAM might be damaged.

Memory errors _ARE_ a lot more common than people would like them to be.
False positives are not common at all, but yes, a kernel bug could misdetect
errors, and you _are_ running an ancient kernel.

> Unfortunately, running memtest86+ is not an option, as the server in
> question is a production server, and I don't have a second server for
> redundancy.

You better rethink that.  Your only server is spewing memory errors right
and left, you don't have a second server, and you don't want to schedule
emergency repair?

> /proc/sysrq-trigger' do stop the logging flood to console. Did I miss
> anything, or is it simply impossible to stop console logging for this
> kind of kernel error messages. That would be very unfortunate.

dmesg -c can do it.

> Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
> 1st FATAL Err Reg= 0x4
> Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
> intelligent throttling disabled

Well, it is complaining of memory module overheat, and that your BIOS has
not programmed the chipset to slow down memory modules when they overheat.
I've never seen any of the i5000-based servers around here doing that.

> Handle 0x003A, DMI type 17, 27 bytes
> Memory Device
> 	Array Handle: 0x0038
> 	Error Information Handle: Not Provided
> 	Total Width: 72 bits
> 	Data Width: 64 bits
> 	Size: 4096 MB
> 	Form Factor: FB-DIMM
> 	Set: 1
> 	Locator: ONBOARD DIMM_A1
> 	Bank Locator: Channel A
> 	Type: DDR2 FB-DIMM
> 	Type Detail: Synchronous
> 	Speed: 667 MHz (1.5 ns)
> 	Manufacturer: 8551
> 	Serial Number: 02028121
> 	Asset Tag: Not Specified
> 	Part Number: 72T512920EFA3SC

FB-DIMMS are nasty heat sources.  If they're not being cooled properly, they
WILL get damaged.

  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Reply to: