Re: .d.o machines which are down (Re: Questions for the DPL candidates)
On Sun, Mar 20, 2005 at 06:17:10PM -0800, Steve Langasek wrote:
> The primary alpha buildd last summer, lully.d.o, went off-line due to
> hardware failures and we were left with an under-powered backup, escher,
> that was unable to keep up with the package load. This persisted for more
> than a month, IIRC, until goedel.d.o could be brought on-line to replace
> lully. Goedel has been down for a moderate period of time at least once
> since then.
> For sparc, a second buildd was brought on-line on auric this year because
> (IIRC) vore was not keeping up with the upload volume at the time; this
> required effort on DSA's part to clear enough disk space to be able to run a
> buildd, until which time sparc was holding some RC bugfixes out of testing.
> If sparc had had a buildd in reserve, this would not have affected the flow
> of development for sarge. Auric is now off-line, as noted.
> ARM, mips, and mipsel have each repeatedly had problems keeping up with
> unstable due to hardware failures. (Sometimes compound hardware failures,
> which are obviously going to be statistically more common when you need more
> pieces of hardware to keep up in the first place...)
Each time, we were told these incident would not impact the release
become the buildd had plenty of time to catch up, and that the situation
was under control and that the help proposed in term of new hardware and
manpower was not needed at that stage.
Suddenly, this is so much a problem we are considering dropping 8
architectures ? In that case, isn't it time to first reconsider all the
> These incidents have come to my attention precisely because they've impacted
> my work as release manager. I do not want to spend my time during the etch
> release cycle cajoling porters into getting buildd hardware back on-line --
> I have spent too much time already waiting on buildds for sarge to be
> willing to do that again. Buildds for release architectures need to be
> keeping up on an ongoing basis, not just when everything's working right;
> because the odds are very much against everything working right, on all
> release architectures, for the duration of a release cycle unless there's
> redundancy in place. I'm not willing to accept an architecture as a release
> candidate for etch unless there is redundancy in place -- if a port is not
> willing to provide enough buildd hardware to prevent hardware failures from
> causing my work to pile up, then I'm not willing to release manage that
Of course they need to, but so far we were told everything was working
fine which does not provide a big incentive toward improving the buildd
network, especially when the w-b admins are designing the new w-b
infrastructure and had probably less time to deal with the offers.
Imagine a large red swirl here.