On Thu, Mar 07, 2013 at 12:05:02AM -0500, Gary Dale wrote:
The issue is the probability of failure of a second drive when the
array is vulnerable after the failure of one drive. Given that all
modern drives have SMART capability, you can normally detect a
faulty drive long before it fails. The chances of a second failure
during the rebuild are small.
Ah, that's the problem. The odds of a second failure during the
rebuild are much, much higher than you would naively expect.
Let's say your disks have an unrecoverable error rate of 1 in
10^14. (This is a plausible figure; look to your manufacturer's
support site for specifics.) That's one in 12 terabytes,
approximately.
If you have four 2TB disks, and they are in a RAID10, recovering
one disk means reading 2TB and writing 2TB. About 1 in 6 chance
of something going wrong during that process.
If you have the same four 2TB disks in a RAID5, you need to read
6TB of information and write 2TB. That's a 50% chance of
something going wrong.
The larger problem is having a defective array that goes undetected.
That's why mdadm is normally configured to check the array for
errors periodically.
This is, indeed, a large problem with a good specified solution.
RAID 6 only takes one more drive and removes even these small
failure windows. RAID 1 simply uses too much hardware for the slight
increase in reliability it gives relative to RAID 5. If you're super
concerned about reliability, go to RAID 6.
On 4 disks, RAID10 is usually better than RAID6. You get the
speed advantage of not having to calculate checksums, and the
same capacity.