[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

EDAC errors: false positives or broken RAM? (i5000)



Hey,

I'm getting tons of EDAC error messages from the kernel lately on a
i5000 server with 16GB RAM (4 x 4GB modules). The server runs since
about three years.

The system is Debian Lenny with 2.6.26 kernel, selfcompiled from
linux-source-2.6.26 2.6.26-26lenny3.

It's not the first time, that these EDAC error messages appear.
Actually, in the last three years, I got these errors every now and
then. Sometimes only few errors where logged, sometimes my logs were
spammed with the errors for several days, but then it stopped again.

Now, the messages keep spamming my log and console for more than three
weeks already. A some days I get more than 36000 errors a day.

It's noteable, that every DRAM-Bank from 0 to 7 is affected.

Now I wonder, whether these are false positives (searching for the
errors in the web revealed that these are quite common), or whether my
RAM might be damaged.

Unfortunately, running memtest86+ is not an option, as the server in
question is a production server, and I don't have a second server for
redundancy.

Additionally, a slightly related question: How do I turn off the logging
of these messages to console? It's impossible to work in a SSH session
when the console is spammed with these logs. Neither setting
kernel.printk, nor 'setterm -msg 0', 'dmesg -n1' or 'echo 1 >
/proc/sysrq-trigger' do stop the logging flood to console. Did I miss
anything, or is it simply impossible to stop console logging for this
kind of kernel error messages. That would be very unfortunate.

I already considered to recompile the kernel without EDAC i5000 driver
in order to stop this annoyance, but I would prefer to fix the reason
instead of fighting the symptoms.

Here's an example error message:

Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:20 nibbler kernel: EDAC MC0: UE row 1, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=6 RDWR=Read RAS=14214 CAS=0
FATAL Err=0x4)
Aug 16 13:08:22 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:22 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:22 nibbler kernel: EDAC MC0: UE row 0, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=20 CAS=0
FATAL Err=0x4)
Aug 16 13:08:24 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:24 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:24 nibbler kernel: EDAC MC0: UE row 1, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=1 RDWR=Read RAS=3268 CAS=0
FATAL Err=0x4)

This is what the EDAC module logged at my last reboot:

Jun 27 00:10:29 nibbler kernel: EDAC MC: Ver: 2.1.0 Jun 23 2011
Jun 27 00:10:29 nibbler kernel: EDAC MC0: Giving out device to
'i5000_edac.c' 'I5000': DEV 0000:00:10.0
Jun 27 00:10:29 nibbler kernel: EDAC PCI0: Giving out device to module
'i5000_edac' controller 'EDAC PCI controller': DEV '0000:00:10.0' (POLLED)

And last but not least the output of 'dmidecode -t memory':

# dmidecode 2.9
SMBIOS 2.5 present.

Handle 0x0038, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 32 GB
	Error Information Handle: Not Provided
	Number Of Devices: 8

Handle 0x003A, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 4096 MB
	Form Factor: FB-DIMM
	Set: 1
	Locator: ONBOARD DIMM_A1
	Bank Locator: Channel A
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: 667 MHz (1.5 ns)
	Manufacturer: 8551
	Serial Number: 02028121
	Asset Tag: Not Specified
	Part Number: 72T512920EFA3SC

Handle 0x003C, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: FB-DIMM
	Set: 2
	Locator: ONBOARD DIMM_A2
	Bank Locator: Channel A
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: MemUndefined
	Serial Number: MemUndefined
	Asset Tag: Not Specified
	Part Number: MemUndefined

Handle 0x003D, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 4096 MB
	Form Factor: FB-DIMM
	Set: 1
	Locator: ONBOARD DIMM_B1
	Bank Locator: Channel B
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: 667 MHz (1.5 ns)
	Manufacturer: 8551
	Serial Number: 02027215
	Asset Tag: Not Specified
	Part Number: 72T512920EFA3SC

Handle 0x003F, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: FB-DIMM
	Set: 2
	Locator: ONBOARD DIMM_B2
	Bank Locator: Channel B
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: MemUndefined
	Serial Number: MemUndefined
	Asset Tag: Not Specified
	Part Number: MemUndefined

Handle 0x0040, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 4096 MB
	Form Factor: FB-DIMM
	Set: 5
	Locator: ONBOARD DIMM_C1
	Bank Locator: Channel C
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: 667 MHz (1.5 ns)
	Manufacturer: 8551
	Serial Number: 02027112
	Asset Tag: Not Specified
	Part Number: 72T512920EFA3SC

Handle 0x0042, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: FB-DIMM
	Set: 6
	Locator: ONBOARD DIMM_C2
	Bank Locator: Channel C
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: MemUndefined
	Serial Number: MemUndefined
	Asset Tag: Not Specified
	Part Number: MemUndefined

Handle 0x0043, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 4096 MB
	Form Factor: FB-DIMM
	Set: 5
	Locator: ONBOARD DIMM_D1
	Bank Locator: Channel D
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: 667 MHz (1.5 ns)
	Manufacturer: 8551
	Serial Number: 02028522
	Asset Tag: Not Specified
	Part Number: 72T512920EFA3SC

Handle 0x0045, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0038
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: FB-DIMM
	Set: 6
	Locator: ONBOARD DIMM_D2
	Bank Locator: Channel D
	Type: DDR2 FB-DIMM
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: MemUndefined
	Serial Number: MemUndefined
	Asset Tag: Not Specified
	Part Number: MemUndefined

Greetings,
 jonas


Reply to: