Re: .d.o machines which are down (Re: Questions for the DPL candidates)

To: debian-devel@lists.debian.org
Subject: Re: .d.o machines which are down (Re: Questions for the DPL candidates)
From: Joel Aelwyn <fenton@debian.org>
Date: Wed, 16 Mar 2005 23:31:22 -0700
Message-id: <[🔎] 20050317063122.GU7864@spawn.internal.mnemosyne-consulting.com>
Mail-followup-to: debian-devel@lists.debian.org
In-reply-to: <[🔎] 20050317003237.GB12681@internal>
References: <[🔎] 423611AE.7030709@azure.humbug.org.au> <[🔎] 20050314233738.GA26922@pegasos> <[🔎] 87u0nd9ojf.fsf@frigate.technologeek.org> <[🔎] 20050315142813.GA13754@deprecation.cyrius.com> <[🔎] 200503160444.j2G4intX004544@renig.nat.blars.org> <[🔎] 20050316213851.GA12073@internal> <[🔎] 87u0nb15bc.fsf@becket.becket.net> <[🔎] 20050316233028.GA12399@internal> <[🔎] 874qfb10o4.fsf@becket.becket.net> <[🔎] 20050317003237.GB12681@internal>

On Wed, Mar 16, 2005 at 07:32:37PM -0500, Ben Collins wrote:
> On Wed, Mar 16, 2005 at 06:11:39PM -0800, Thomas Bushnell BSG wrote:
> > Ben Collins <bcollins@debian.org> writes:
> > 
> > > The requirement sucks, lets leave it at that. If the machine dies, I can
> > > have two to replace it within a day or two.
> > > 
> > > The point being, there's no reason to have two seperate machines when one
> > > can do the job. As long as it keeps up, then there should be no cause for
> > > concern.
> > 
> > If you have one machine, and it dies, and it takes you a day or two to
> > replace it, then it cannot "do the job".  If you can guarantee that it
> > never dies (somehow), then maybe it could.
> 
> Ok, I can guarantee that it never dies. The hardrives are raid 5
> configuration, and the power supplies are redundant, and if any of the
> three cpu/mem boards goes bad, I can just remove it and let the other two
> (4x cpu's and 4gigs ram) run. Then there's also two 10/100mbit ethernet
> adapters.
> 
> It wont die all together, it's an enterprise class system. It's meant to
> keep going, even if it has to limp to do so. Even with 1 cpu/mem board, it
> still would have 2 cpu's and 2gigs of ram.

And when the network to that building dies due to Backhoe-Induced Fiber
Fade, it will cheerfully sit there and chug along at a very fast idle with
nothing to do.

Don't even bother bringing up "redundant fiber". It may be, if it hasn't
been regroomed, and twenty plus years of network administrators have
learned the hard way that the gun is ALWAYS loaded. The best you can hope
for is a misfire.

This applies equally to having twenty buildds sitting in a rack, of course.
If the purpose of the B={1,2}+1 rule is, in part, protection against
buildds vanishing off the face of the earth, we need to decide just what
level of redundancy we really want - resilience to fiber cuts? City power
loss or statewide rolling blackouts? Tsunami? ISP business failures?

In case you missed the point, I've known networks knocked out by each of
the above, in some cases permanently (or for long enough to be effectively
equivalent). Hell, I shut off the primary ISP link for Guam for non-payment
at a former employer, multiple months in a row. The iron doesn't do us any
good if it isn't reachable. If we're going to propose new standards, let's
not forget to include important details.

For example:

* The architecture must have sufficient primary and backup buildds, of
  sufficient capability, to meet the following requirements:

  - No failure of power grid (up to regional size), physical facility, or
    network access (alone or in combination) can take down enough machines
    that no buildds are available for any period longer than six hours, at
    any time, or require local admin intervention to come online in case of
    a failure. It must be possible to return the port to normal operation
    within one week from initial failure.

    [ If we're concerned about acts of god or man that can hit multiple   ]
    [ regional power grids, we have a very different set of concerns.     ]
    [ This also allows for primary/hot-backup machines, as long as the    ]
    [ backup machines can be brought online by DSAs with no local admin   ]
    [ required, and within six hours. If the DSA folks need to take       ]
    [ longer than a week to return the buildds to operation, that's their ]
    [ domain, but it must be *possible* to do within a week by reasonable ]
    [ standards - not 24x7 multiple-person herculean efforts. Restoring a ]
    [ buildd is likely to take local admin intervention, of course.       ]

  - During normal operation, the buildds must be able to keep up with
    day to day package compilation loads. This means that the average
    package-built-and-uploaded rate must be at least 110%, and that the
    architecture completely empties the build queue at least once per week.

    [ If it wasn't obvious, 110% is deliberate ludicrous; it's a minor    ]
    [ detail that frankly should probably be determined by observing      ]
    [ our current 'successful' buildds, the ones nobody complains about   ]
    [ in general, and seeing how well they really do. The second clause   ]
    [ could be made irrelevant by changing the build ordering algorithm   ]
    [ to one that doesn't have starvation failure cases, but unless it    ]
    [ is, it matters that the buildds for an architecture *regularly*     ]
    [ empty the build queue under normal operation. Putting it at a week  ]
    [ is my own personal guess at a balance between 'not penalized for    ]
    [ upload surges' (from things like BSPs) and 'maintainers can assume  ]
    [ that low priority uploads can proceed to testing without undue      ]
    [ delay'.                                                             ]
    
  - At least one buildd supporting security-related build queues must be
    available at all times, even during degraded operations. All buildds
    which handle security-related build queues must be able to compile any
    package intended for a stable release in no more than 300% of the time
    required by the slowest build for any architecture which qualifies for
    distribution on the (primary mirror network / tier-1 mirrors / etc).

    [ These requirements are to ensure reasonable and prompt security     ]
    [ support, and as such, are simply my best guess from what various    ]
    [ people have said matters for this; if anyone on the security        ]
    [ team(s) wants to provide further information on what matters in     ]
    [ their work, that obviously takes priority over guessing...          ]

    [ As has been stated elsewhere, if the slower architectures can       ]
    [ reliably compile a single package fast enough, whether it's by      ]
    [ distcc, binary gnomes, or magic smoke, that suffices for this       ]
    [ requirement. The idea behind 300% of the slowest tier-1 arch is     ]
    [ mostly that it be able to keep up with the 'popular' archs for      ]
    [ security compiles, and thus won't unduly delay a security release.  ]
    [ It would also be possible to set a specific time limit on *any*     ]
    [ package for *any* architecture to build, but that seems fraught     ]
    [ with landmines.                                                     ]

    [ The recommended setup, of course, would be for *all* buildds to be  ]
    [ sufficiently capable to handle the security queues, and for them    ]
    [ all to then do so, whenever possible, but this requirement should   ]
    [ guarantee that it's always possible to do it at least semi-sanely,  ]
    [ though it might require manual intervention if a slower arch is     ]
    [ building, say, GCC or X on the only active high-rate box.           ]

-- 
Joel Aelwyn <fenton@debian.org>                                       ,''`.
                                                                     : :' :
                                                                     `. `'
                                                                       `-

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: David Schmitt <david@schmitt.edv-bus.at>

References:
- Re: Questions for the DPL candidates
  - From: Anthony Towns <aj@azure.humbug.org.au>
- Re: Questions for the DPL candidates
  - From: Sven Luther <sven.luther@wanadoo.fr>
- Re: Questions for the DPL candidates
  - From: Julien BLACHE <jblache@debian.org>
- .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Martin Michlmayr - Debian Project Leader <leader@debian.org>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Blars Blarson <blarson@blars.org>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Ben Collins <bcollins@debian.org>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Thomas Bushnell BSG <tb@becket.net>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Ben Collins <bcollins@debian.org>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Thomas Bushnell BSG <tb@becket.net>
- Re: .d.o machines which are down (Re: Questions for the DPL candidates)
  - From: Ben Collins <bcollins@debian.org>

Prev by Date: Re: Bits (Nybbles?) from the Vancouver release team meeting
Next by Date: Re: Required firewall support
Previous by thread: Re: .d.o machines which are down (Re: Questions for the DPL candidates)
Next by thread: Re: .d.o machines which are down (Re: Questions for the DPL candidates)
Index(es):
- Date
- Thread