[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Server hardware advice.



On 8/8/19 7:22 AM, Dan Ritter wrote:
To summarize: if you're running ZFS, it can protect you from
lots of sources of data corruption. It can't protect you from
RAM errors without ECC, so you should opt for ECC if integrity
is your goal.

None of the other filesystems protect you against RAM errors
either, so this is not a special requirement of ZFS.

+1


The same goes for anything that uses main memory, which is pretty much everything I use computers for.


Bad data in memory is bad enough, but bad data written to disk is the gift that keeps on giving -- replication overwriting good data, snapshot and backup rotation overwriting good data, archive destruction destroying good data, etc.. The longer it takes to figure out the data is bad, the less likely you can recover.


For me, the key points in favor of ECC are:

1. Wikipedia gives DRAM bit error rates (BER) from 10E-10 to 10E-17 errors per bit per hour [1]. So, 1 error per year for 114 kB to 1.14 TB of DRAM on average under some test conditions.

2. In the wild, not all chips, modules, sockets, capacitors, motherboards, etc., are healthy or compatible. Real BER's can be much higher.

3. The BER of DRAM tends to increase as the transistors, capacitors, lines, etc., get smaller and faster [2]. Given Moore's Law, manufacturers must be hard pressed just to maintain the BER with each new generation.

4. Moore's Law again: the amount of DRAM in devices has been increasing exponentially, thus adding more DRAM that can error.


So, it is just a matter of time before the error probability curve crosses the BER specification. One article I read said desktops and laptops already crossed it at 8 to 16 GB. COTS servers can have one or two orders of magnitude more memory.


David


[1] https://en.wikipedia.org/wiki/Dynamic_random-access_memory

[2] https://danluu.com/why-ecc/


Reply to: