[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: EDAC errors: false positives or broken RAM? (i5000)



Hey,

Am 16.08.2011 19:31, schrieb Henrique de Moraes Holschuh:
> On Tue, 16 Aug 2011, Jonas Meurer wrote:
>> Now, the messages keep spamming my log and console for more than three
>> weeks already. A some days I get more than 36000 errors a day.
>>
>> It's noteable, that every DRAM-Bank from 0 to 7 is affected.
>>
>> Now I wonder, whether these are false positives (searching for the
>> errors in the web revealed that these are quite common), or whether my
>> RAM might be damaged.
> 
> Memory errors _ARE_ a lot more common than people would like them to be.
> False positives are not common at all, but yes, a kernel bug could misdetect
> errors, and you _are_ running an ancient kernel.

Thanks for your reply. I know, that I'm running an ancient kernel.
Actually, I plan to upgrade the system to Debian Squeeze in one or two
months. Will see whether that helps.

I thought, that case of memory errors, other strange things would happen
as well. Processes would segfault and so on. If the system really has
memory errors, and the EDAC error messages refer to these errors, then
the memory errors already exist for more than a year. But nothing
*strange* happened in this year.
Maybe the memory errors don't lead to data corruption? I don't know much
about the interals of RAM modules. My naive opinion until now was, that
RAM errors imply big problems. But maybe errors with low severity do
exist as well, just like the overheating problems you mention below?

>> Unfortunately, running memtest86+ is not an option, as the server in
>> question is a production server, and I don't have a second server for
>> redundancy.
> 
> You better rethink that.  Your only server is spewing memory errors right
> and left, you don't have a second server, and you don't want to schedule
> emergency repair?

First - as mentioned earlier - I'm not convinced that the error messages
are caused by real memory errors. Searching the web for EDAC errors gave
loads of threads and blog posts about EDAC error messages being false
positives. And the fact that all banks are affected was mentioned
several times as indicator for false posítives.

Now I see that overheating might affect all banks as well.

I do have backups, and am able to setup the same server again within
hours. Unfortunately this is the only option for financial reasons.

>> /proc/sysrq-trigger' do stop the logging flood to console. Did I miss
>> anything, or is it simply impossible to stop console logging for this
>> kind of kernel error messages. That would be very unfortunate.
> 
> dmesg -c can do it.

As you already wrote yourself, 'dmesg -n' is what you mean. But
unfortunately even 'dmesg -n1' doesn't stop the EDAC error messages to
be logged to my ssh session.

>> Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: FATAL ERRORS Found!!!
>> 1st FATAL Err Reg= 0x4
>> Aug 16 13:08:20 nibbler kernel: EDAC i5000 MC0: >Tmid Thermal event with
>> intelligent throttling disabled
> 
> Well, it is complaining of memory module overheat, and that your BIOS has
> not programmed the chipset to slow down memory modules when they overheat.
> I've never seen any of the i5000-based servers around here doing that.

Maybe a BIOS update might help? I'll check that when I'm back at home.

Here's more information about the system:

# lspci -v -s 00:00.0
00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller
Hub (rev b1)
	Subsystem: Intel Corporation Intel S5000PSLSATA Server Board
	Flags: bus master, fast devsel, latency 0
	Capabilities: [50] Power Management version 2
	Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Queue=0/1
Enable-
	Capabilities: [6c] Express Root Port (Slot-), MSI 00
	Capabilities: [100] Advanced Error Reporting <?>

>> Handle 0x003A, DMI type 17, 27 bytes
>> Memory Device
>> 	Array Handle: 0x0038
>> 	Error Information Handle: Not Provided
>> 	Total Width: 72 bits
>> 	Data Width: 64 bits
>> 	Size: 4096 MB
>> 	Form Factor: FB-DIMM
>> 	Set: 1
>> 	Locator: ONBOARD DIMM_A1
>> 	Bank Locator: Channel A
>> 	Type: DDR2 FB-DIMM
>> 	Type Detail: Synchronous
>> 	Speed: 667 MHz (1.5 ns)
>> 	Manufacturer: 8551
>> 	Serial Number: 02028121
>> 	Asset Tag: Not Specified
>> 	Part Number: 72T512920EFA3SC
> 
> FB-DIMMS are nasty heat sources.  If they're not being cooled properly, they
> WILL get damaged.

I don't have control over the hardware. It's provided by the datacenter.

Greetings,
 jonas


Reply to: