[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: .d.o machines which are down (Re: Questions for the DPL candidates)



On Sun, Mar 20, 2005 at 01:26:38PM -0500, Ben Collins wrote:
> I think they are designed too stringently. Guidelines should describe the
> level of stability an arch is required to meet, and let the implementation
> be whatever is needed, on a per arch basis, to meet those requirements.

> The guidelines should not say something like "needs two buildds minimum",
> but instead, "needs to remain within 2 days of package building times",
> and set repercussions based on stability of build times and turnaround,
> rather than number of buildds.

> Case in point, mips and arm cannot even cope with only 2 buildds. It isn't
> enough. However archs like x86-64, sparc, etc. can keep up with just one,
> or two. So the number of buildds isn't what is important. The problem with
> architectures not keeping up isn't a matter of buildd stability so much as
> speed. I don't recall any architecures falling behind miserably just
> because a buildd went down for an extended period, but I do recall some
> (m68k) having problems simply because of lack of processing power.

The primary alpha buildd last summer, lully.d.o, went off-line due to
hardware failures and we were left with an under-powered backup, escher,
that was unable to keep up with the package load.  This persisted for more
than a month, IIRC, until goedel.d.o could be brought on-line to replace
lully.  Goedel has been down for a moderate period of time at least once
since then.

For sparc, a second buildd was brought on-line on auric this year because
(IIRC) vore was not keeping up with the upload volume at the time; this
required effort on DSA's part to clear enough disk space to be able to run a
buildd, until which time sparc was holding some RC bugfixes out of testing.
If sparc had had a buildd in reserve, this would not have affected the flow
of development for sarge.  Auric is now off-line, as noted.

ARM, mips, and mipsel have each repeatedly had problems keeping up with
unstable due to hardware failures.  (Sometimes compound hardware failures,
which are obviously going to be statistically more common when you need more
pieces of hardware to keep up in the first place...)

These incidents have come to my attention precisely because they've impacted
my work as release manager.  I do not want to spend my time during the etch
release cycle cajoling porters into getting buildd hardware back on-line --
I have spent too much time already waiting on buildds for sarge to be
willing to do that again.  Buildds for release architectures need to be
keeping up on an ongoing basis, not just when everything's working right;
because the odds are very much against everything working right, on all
release architectures, for the duration of a release cycle unless there's
redundancy in place.  I'm not willing to accept an architecture as a release
candidate for etch unless there is redundancy in place -- if a port is not
willing to provide enough buildd hardware to prevent hardware failures from
causing my work to pile up, then I'm not willing to release manage that
port.

As I already said, you can squeeze by without geographic separation if you
choose, I just don't think it's a good idea for the Sparc porters to satisfy
themselves with this arrangement given that it means any prolonged outage at
Visi.net will mean immediately dropping Sparc from consideration as a
release arch.

-- 
Steve Langasek
postmodern programmer

Attachment: signature.asc
Description: Digital signature


Reply to: