[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian hotswap and 5 9's



on Wed, Dec 11, 2002 at 04:21:21PM +0100, Rogier Wolff (R.E.Wolff@BitWizard.nl) wrote:
> On Wed, Dec 11, 2002 at 02:19:23AM -0800, nate wrote:
> > Rogier Wolff said:
> > 
> > > No.
> > >
> > > Think RAID.
> > 
> > think CPU fan fails, CPU overheats, CPU fails, system crashes.
> 
> You misunderstand my "think Raid" remark. In a RAID configuration you
> can handle a WHOLE DISK going offline. If your SYSTEM can handle a
> whole CPU giving the ghost, then you can still achieve high uptimes 
> by just taking over the jobs on another machine. 

Repeating your assertion doesn't make it true.

Say, didn't the Netherlands just suffer a fire at Twente?  How many
"five 9s" servers did that take out?  Or an event some might recall
occuring in or about NYC September 11 of 2001.

At the level of five nines support, you're not talking single systems,
and you're very likely not talking single NOCs.  Better, your NOCs
should be several hours' distance apart, be served by independent
Internet backbones (if 'Net attached), or WAN links, and be wired into
relatively independent power grids.  Wind damange in Goose Lake, at the
CA/OR border, knocked 45% of California customers off the grid August
10, 1996[1].

Single-server uptimes of 1-2 years are not valid datapoints unless drawn
from a statistically valid sample.  Otherwise you're at best
demonstrating survivor identification capabilities.  A credible record
should point to a multi-year history, across multiple individual hosts,
comprising a "system".  Net uptime and/or availability of this system,
in the context of anticipated service, HW and SW upgrades, and
reasonably anticipated emergency occurances (fire, flood, power outage,
earthquake, hurricane, severe wind, civil unrest, internal sabotage or
compromise) _might_ make a credible basis for claims.

Note that with the emerging significance of highly modular redundant x86
form factors (eg:  "blade" servers with 300+ nodes per standard 19"
rack), RAID may in fact play _no_ role, as service would consist of
wholesale replacement of individual nodes.  IBM's work on "self healing"
systems doesn't even call for replacment[2].  Instead, anomolous units
are simply shut down entirely, with the system as a whole having
sufficient redundent capacity to accomodate anticipated failures over
the planned life of the system.

Peace.

----------------------------------------
Notes:

1.  http://www.energy.ca.gov/reports/70097003.html  This was one of
    _two_ major outages summer 1996, and followed another extensive,
    statewide, outage, lasting upwards of a week, following the storm of
    Dec 13, 1995.


2.  I sat in on a seminar on this topic at the Stanford Computer System
    Lab Colloquium, neat stuff:
    http://www.stanford.edu/class/ee380/Abstracts/011128.html

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
   If spam is the question, Spamassassin is the answer.
     http://spamassassin.taint.org/



Reply to: