Re: backup archive format saved to disk

Douglas Tutty wrote:
Mike,
Without expending any mathematical energy, could you recompute your two
probabilities based on a set of three disks instead of 2?  I'm guessing
that the probability of one disk failing goes up but the probability of
all three failing drops substantially (the famious tripple-redundancy
theory).

It's very easy. Suppose we have events E1, E2, E3, ... En. Suppose that
the occur independently. Suppose that Pr(Ek) = Pk. Then the probability
that at least one of the Ek occurs is

1 - (1-P1)(1-P2)(1-P3)...(1-Pn)

If the probabilites are all the same, Pk = p for all k, then it is

1 - (1-p)^n

In the case we just discussed where p = 0.01 and using n = 3 per your
request, we get

Pr(one or more of three discs fails) = 1 - 0.99^3 = 0.029701

I'm assuming that a partialy failed disk will return good data (because
of the FEC) and that an error notice ends up in syslog (do you know the
severity)?

Depends on how "partially" it fails. It may be "partially" failed so
that some sectors are readable, and others are not.

I'm not a Linux expert, so I don't know the answer to your question.

How does a raid1 array handle a partially failing disk?  Does it just
take the good data and carry on until the drive completly fails or does
mdadm also get involved in issuing a warning of a failing drive?

Umm,

RAID 0		striping, results in speedy access only, not true RAID
RAID 1		writes redundant copies of data to two (or more) discs
RAID 3		writes redundant data to three (or more) discs reserves
one disc as "parity"; requires at least four discs

I've implemented a RAID 1 system in a fully redundant system (dual
discs, dual controllers, each controller able to control each
disc, both controllers connected to independent computers with
separate power supplies). That's the way I did it. Writes went
to both discs, and did not return to the requesting app until
both writes were complete unless a no wait I/O was requested,
in which case I provided notification when the write completed.

Reads simply used the disc on the same side using the controller
on the same side as the requesting CPU, unless one disc failed.
If I got read error reports, then I used a leaky bucket. If the
leaky bucket overflowed, then I placed the suspect disc out of
service, asserted a frame alarm, issued an Information or Problem
Report (IPR) and continued with just one disc. When the disc
got replaced, then I'd format it, and start equalizing. I started
at the bottom of disc, and kept a high water mark. Any writes
below the high water mark were done redundantly, while writes
above the high water mark were done simplex. When the high
water mark reached the top of disc, then it was put back into
service, and the alarm was abated.

I did the same sort of thing with the controllers, placing
one out of service if it was deemed to be failing, and all
reads and writes took place through the still functioning
controller, until the failing one got replaced.