EDAC errors: false positives or broken RAM? (i5000)
Hey,
I'm getting tons of EDAC error messages from the kernel lately on a
i5000 server with 16GB RAM (4 x 4GB modules). The server runs since
about three years.
The system is Debian Lenny with 2.6.26 kernel, selfcompiled from
linux-source-2.6.26 2.6.26-26lenny3.
It's not the first time, that these EDAC error messages appear.
Actually, in the last three years, I got these errors every now and
then. Sometimes only few errors where logged, sometimes my logs were
spammed with the errors for several days, but then it stopped again.
Now, the messages keep spamming my log and console for more than three
weeks already. A some days I get more than 36000 errors a day.
It's noteable, that every DRAM-Bank from 0 to 7 is affected.
Now I wonder, whether these are false positives (searching for the
errors in the web revealed that these are quite common), or whether my
RAM might be damaged.
Unfortunately, running memtest86+ is not an option, as the server in
question is a production server, and I don't have a second server for
redundancy.
Additionally, a slightly related question: How do I turn off the logging
of these messages to console? It's impossible to work in a SSH session
when the console is spammed with these logs. Neither setting
kernel.printk, nor 'setterm -msg 0', 'dmesg -n1' or 'echo 1 >
/proc/sysrq-trigger' do stop the logging flood to console. Did I miss
anything, or is it simply impossible to stop console logging for this
kind of kernel error messages. That would be very unfortunate.
I already considered to recompile the kernel without EDAC i5000 driver
in order to stop this annoyance, but I would prefer to fix the reason
instead of fighting the symptoms.
Here's an example error message:
Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:20 nibbler kernel: EDAC MC0: UE row 1, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=6 RDWR=Read RAS=14214 CAS=0
FATAL Err=0x4)
Aug 16 13:08:22 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:22 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:22 nibbler kernel: EDAC MC0: UE row 0, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=20 CAS=0
FATAL Err=0x4)
Aug 16 13:08:24 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
1st FATAL Err Reg= 0x4
Aug 16 13:08:24 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
intelligent throttling disabled
Aug 16 13:08:24 nibbler kernel: EDAC MC0: UE row 1, channel-a= 0
channel-b= 1 labels "-": (Branch=0 DRAM-Bank=1 RDWR=Read RAS=3268 CAS=0
FATAL Err=0x4)
This is what the EDAC module logged at my last reboot:
Jun 27 00:10:29 nibbler kernel: EDAC MC: Ver: 2.1.0 Jun 23 2011
Jun 27 00:10:29 nibbler kernel: EDAC MC0: Giving out device to
'i5000_edac.c' 'I5000': DEV 0000:00:10.0
Jun 27 00:10:29 nibbler kernel: EDAC PCI0: Giving out device to module
'i5000_edac' controller 'EDAC PCI controller': DEV '0000:00:10.0' (POLLED)
And last but not least the output of 'dmidecode -t memory':
# dmidecode 2.9
SMBIOS 2.5 present.
Handle 0x0038, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 32 GB
Error Information Handle: Not Provided
Number Of Devices: 8
Handle 0x003A, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: FB-DIMM
Set: 1
Locator: ONBOARD DIMM_A1
Bank Locator: Channel A
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: 667 MHz (1.5 ns)
Manufacturer: 8551
Serial Number: 02028121
Asset Tag: Not Specified
Part Number: 72T512920EFA3SC
Handle 0x003C, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: FB-DIMM
Set: 2
Locator: ONBOARD DIMM_A2
Bank Locator: Channel A
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: Unknown
Manufacturer: MemUndefined
Serial Number: MemUndefined
Asset Tag: Not Specified
Part Number: MemUndefined
Handle 0x003D, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: FB-DIMM
Set: 1
Locator: ONBOARD DIMM_B1
Bank Locator: Channel B
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: 667 MHz (1.5 ns)
Manufacturer: 8551
Serial Number: 02027215
Asset Tag: Not Specified
Part Number: 72T512920EFA3SC
Handle 0x003F, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: FB-DIMM
Set: 2
Locator: ONBOARD DIMM_B2
Bank Locator: Channel B
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: Unknown
Manufacturer: MemUndefined
Serial Number: MemUndefined
Asset Tag: Not Specified
Part Number: MemUndefined
Handle 0x0040, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: FB-DIMM
Set: 5
Locator: ONBOARD DIMM_C1
Bank Locator: Channel C
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: 667 MHz (1.5 ns)
Manufacturer: 8551
Serial Number: 02027112
Asset Tag: Not Specified
Part Number: 72T512920EFA3SC
Handle 0x0042, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: FB-DIMM
Set: 6
Locator: ONBOARD DIMM_C2
Bank Locator: Channel C
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: Unknown
Manufacturer: MemUndefined
Serial Number: MemUndefined
Asset Tag: Not Specified
Part Number: MemUndefined
Handle 0x0043, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: FB-DIMM
Set: 5
Locator: ONBOARD DIMM_D1
Bank Locator: Channel D
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: 667 MHz (1.5 ns)
Manufacturer: 8551
Serial Number: 02028522
Asset Tag: Not Specified
Part Number: 72T512920EFA3SC
Handle 0x0045, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: FB-DIMM
Set: 6
Locator: ONBOARD DIMM_D2
Bank Locator: Channel D
Type: DDR2 FB-DIMM
Type Detail: Synchronous
Speed: Unknown
Manufacturer: MemUndefined
Serial Number: MemUndefined
Asset Tag: Not Specified
Part Number: MemUndefined
Greetings,
jonas
Reply to: