[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: .d.o machines which are down (Re: Questions for the DPL candidates)

On Thu, Mar 17, 2005 at 10:48:04PM +0100, David Schmitt wrote:
> On Thursday 17 March 2005 07:31, Joel Aelwyn wrote:
> > Don't even bother bringing up "redundant fiber". It may be, if it hasn't
> > been regroomed, and twenty plus years of network administrators have
> > learned the hard way that the gun is ALWAYS loaded. The best you can hope
> > for is a misfire.
> Debian is no enterprise, but debian is a group of responsible developers.
> To argue slightly ellipsoid:
> Putting up a requirement for 2 or 3 buildds hints at experiences of
> disasters by those involved that could have been easily fixed by a second
> machine. Thus the requirement.
> Debian as a whole is not very catastrophe-resistant statically, but
> is able to route around urgent breakage - as necessary - on a global
> scale. For example take the fire in U Twente, which took some Debian
> infrastructure down without causing widespread mayhem. There seem to be
> only disagreements on urgency on a smaller scale. Buildd availability for
> example.

All of the proposed standards involve real events that happen on a regular
basis, and apply only to buildds. If, for example, all of the buildds for
$ARCH had lived at utwente, then $ARCH would have been completely unable to
autobuild for whatever amount of time was involved in bringing new hardware
online and setting it up as a buildd box, with all that that entails.

In fact, the fire at UTwente is an excellent example of *why* we should
be concerned with not only "how many machines", but "how many still
operational after $DISASTER". If we're going to bother caring, we should
*actually care*, not just say "Oh, we've got multiple machines, we'll be
fine" until we find out the hard way that having only melted silicon for an
architecture doesn't do much for our package build rates.

Note that the purpose of 'degraded operations' is not to keep up with the
full queue; rather, it's to keep up with *security* builds. The requirement
for "time to restore normal operations" was 1 week, in the proposed
wording; 6 hours only applied to being *completely without* any buildds at
all. "Found first in the wild" exploits still happen, and responding in
a timely fashion is crucial. Our security team does, as a rule, a rather
kick-ass job about this, but they have to have the tools available to *do*
the job.
Joel Aelwyn <fenton@debian.org>                                       ,''`.
                                                                     : :' :
                                                                     `. `'

Attachment: signature.asc
Description: Digital signature

Reply to: