Re: Raid 5

To: debian-user@lists.debian.org
Cc: debian-user@lists.debian.org
Subject: Re: Raid 5
From: Gary Dale <garydale@rogers.com>
Date: Fri, 08 Mar 2013 00:18:47 -0500
Message-id: <[🔎] 513974B7.1020509@rogers.com>
Reply-to: gary@dalefamily.org
In-reply-to: <[🔎] 20130307224012.GG27670@randomstring.org>
References: <[🔎] CAGe0=N2i6ViaZDGHjW5QG2ZhQOdXO3VJ=gk=NXfEKbKJ+QY73g@mail.gmail.com> <[🔎] 51380040.8030708@walnut.gen.nz> <[🔎] 51381FFE.3070707@rogers.com> <[🔎] 20130307224012.GG27670@randomstring.org>

On 07/03/13 05:40 PM, Dan Ritter wrote:

On Thu, Mar 07, 2013 at 12:05:02AM -0500, Gary Dale wrote:

The issue is the probability of failure of a second drive when the
array is vulnerable after the failure of one drive. Given that all
modern drives have SMART capability, you can normally detect a
faulty drive long before it fails. The chances of a second failure
during the rebuild are small.

Ah, that's the problem. The odds of a second failure during the
rebuild are much, much higher than you would naively expect.

Let's say your disks have an unrecoverable error rate of 1 in
10^14. (This is a plausible figure; look to your manufacturer's
support site for specifics.) That's one in 12 terabytes,
approximately.

If you have four 2TB disks, and they are in a RAID10, recovering
one disk means reading 2TB and writing 2TB. About 1 in 6 chance
of something going wrong during that process.

If you have the same four 2TB disks in a RAID5, you need to read
6TB of information and write 2TB. That's a 50% chance of
something going wrong.

Your calculation is naive. The figure you want is MTBF or expectedservice life. Both are large compared to the time to rebuild the RAIDarray. Moreover, sudden catastrophic failures are far less common thangradual deterioration. You usually get some warning of failure - so makesure your SMART daemon is working.

If you suddenly find two disks going bad, shut down your computer andcopy both onto new drives, then restart the RAID array using the newdrives. Unless you have the same blocks bad on both disks, you should be OK.

The time to rebuild depends on how busy your computer is. If possible, Irecommend taking the machine offline while rebuilding so that thechances of a second failure are minimized. If this is a high-useproduction machine, you probably want RAID 6. If it's a workstation,boot to a command prompt only so that the machine isn't busy operating aGUI.

The larger problem is having a defective array that goes undetected.
That's why mdadm is normally configured to check the array for
errors periodically.

This is, indeed, a large problem with a good specified solution.

RAID 6 only takes one more drive and removes even these small
failure windows. RAID 1 simply uses too much hardware for the slight
increase in reliability it gives relative to RAID 5. If you're super
concerned about reliability, go to RAID 6.

On 4 disks, RAID10 is usually better than RAID6. You get the
speed advantage of not having to calculate checksums, and the
same capacity.

To a point, but RAID 6 can suffer two-disk failures without loss of datawhile RAID10 could suffer loss of half the data under the samecircumstance. It depends on which two disks fail.

The other thing to recognize is that RAID is not backup. Most data
loss takes place through human error, not hardware failure. A good
backup system is your ultimate guard against data loss. RAID is
simply there to keep the hardware running between backups.

Well, uptime and/or performance and/or convenience. But you need
to know what trade-offs you are making.

That's a much larger discussion.  :)

Reply to:

References:
- Raid 5
  - From: Dick Thomas <xpd259@gmail.com>
- Re: Raid 5
  - From: Richard Hector <richard@walnut.gen.nz>
- Re: Raid 5
  - From: Gary Dale <garydale@rogers.com>
- Re: Raid 5
  - From: Dan Ritter <dsr@randomstring.org>

Prev by Date: Re: can't mount USB thumb drive
Next by Date: Re: gcalcli - commandline for google calendar
Previous by thread: Re: Raid 5
Next by thread: Re: Restarting Networking in Debian
Index(es):
- Date
- Thread