[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Raid 5



On 07/03/13 05:40 PM, Dan Ritter wrote:
On Thu, Mar 07, 2013 at 12:05:02AM -0500, Gary Dale wrote:
The issue is the probability of failure of a second drive when the
array is vulnerable after the failure of one drive. Given that all
modern drives have SMART capability, you can normally detect a
faulty drive long before it fails. The chances of a second failure
during the rebuild are small.
Ah, that's the problem. The odds of a second failure during the
rebuild are much, much higher than you would naively expect.

Let's say your disks have an unrecoverable error rate of 1 in
10^14. (This is a plausible figure; look to your manufacturer's
support site for specifics.) That's one in 12 terabytes,
approximately.

If you have four 2TB disks, and they are in a RAID10, recovering
one disk means reading 2TB and writing 2TB. About 1 in 6 chance
of something going wrong during that process.

If you have the same four 2TB disks in a RAID5, you need to read
6TB of information and write 2TB. That's a 50% chance of
something going wrong.
Your calculation is naive. The figure you want is MTBF or expected service life. Both are large compared to the time to rebuild the RAID array. Moreover, sudden catastrophic failures are far less common than gradual deterioration. You usually get some warning of failure - so make sure your SMART daemon is working.

If you suddenly find two disks going bad, shut down your computer and copy both onto new drives, then restart the RAID array using the new drives. Unless you have the same blocks bad on both disks, you should be OK.

The time to rebuild depends on how busy your computer is. If possible, I recommend taking the machine offline while rebuilding so that the chances of a second failure are minimized. If this is a high-use production machine, you probably want RAID 6. If it's a workstation, boot to a command prompt only so that the machine isn't busy operating a GUI.


The larger problem is having a defective array that goes undetected.
That's why mdadm is normally configured to check the array for
errors periodically.
This is, indeed, a large problem with a good specified solution.

RAID 6 only takes one more drive and removes even these small
failure windows. RAID 1 simply uses too much hardware for the slight
increase in reliability it gives relative to RAID 5. If you're super
concerned about reliability, go to RAID 6.
On 4 disks, RAID10 is usually better than RAID6. You get the
speed advantage of not having to calculate checksums, and the
same capacity.
To a point, but RAID 6 can suffer two-disk failures without loss of data while RAID10 could suffer loss of half the data under the same circumstance. It depends on which two disks fail.


The other thing to recognize is that RAID is not backup. Most data
loss takes place through human error, not hardware failure. A good
backup system is your ultimate guard against data loss. RAID is
simply there to keep the hardware running between backups.
Well, uptime and/or performance and/or convenience. But you need
to know what trade-offs you are making.
That's a much larger discussion.  :)


Reply to: