[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: unexpected NMUs || buildd queue



Wouter Verhelst <wouter@grep.be> writes:

> On Sat, Jul 17, 2004 at 08:08:21PM +0200, Thiemo Seufer wrote:
>> Wouter Verhelst wrote:
>> [snip]
>> > > > What's your alternative?
>> > > 
>> > > Obviously to clean the chroot automatically unless a clean-buildd-shutdown
>> > > flag was written. Shouldn't be hard to implement.
>> > 
>> > I'd like to see that implemented properly. There are a number of issues
>> > I can think of offhand that would make it hard:
>> > * the system crash might the result of the build itself. Runaway
>> >   memory-eating loop, causing the swapper to trash so horribly that a
>> >   power cycle is the fastest way to get it up and running again, for
>> >   example. Yes, those happen. In such a case, you don't want to wipe out
>> >   the chroot; you want to check out what went wrong, and you might need
>> >   whatever's in the chroot to find out.
>> 
>> Then, on startup, the buildd should check for a existing chroot, and
>> stop or move it aside if it wasn't clean.
>
> So, you're advocating manual cleanup here.
>
>> > * unstable is a moving (and breaking) target. Doing a debootstrap --
>> >   especially when tried noninteractively -- doesn't always work. And
>> >   yes, we need the chroot to be unstable. Think about it.
>> 
>> Try an unstable debootstrap, if this fails, try testing and upgrade,
>> if this fails as well, try stable and upgrade.
>
> By that time, you've wasted so many time that your backlog is probably
> so huge you wished buildd didn't try to be so smart, kicked it out of
> the process table, and cleaned up stuff manually instead.
>
>> Alternative: Keep an clean unstable chroot tarball around and update
>> it regularily.
>
> Waste of resources. buildd doesn't crash every day (luckily), and
> updating a chroot tarball requires quite a bit of resources (in CPU time
> and disk buffers): "untar+gunzip tarball, chroot (which loads a number
> of binaries to memory, thereby pushing other stuff out of the disk
> buffers that are used to do useful things with the system), apt-get
> update (which requires gunzip and some relatively cpu-intensive parsing
> as well), apt-get upgrade (which could fail or loop), exit the chroot
> and tar+gzip"

Loughable for any recently build system. Even creating the chroot from
scratch with cdebootstrap is a matter of one to a few minutes nowadays.

And for m68k considering disk buffers as problem is a joke. The 128 MB
ram will have been flushed and reflushed just by installing/purging
those 200MB Build-Depends of the last gnome or kde build.

You also can't count the time the apt-get itself takes since with the
current setup you do do exactly the same calls to update the system.

So the difference is untar/gzip and tar/gzip. Yes, they can take some
time on m68k. But that is easily gained by not failing a kde or gnome
package build that installs 200Mb Build-Depends just to notice the
installed version isn't good enough.

> Oh, and you can't just tar up the "live" chroot, it could be in a state
> where it requires maintenance already (and no, that can't be
> automatically detected, at least not always)
>
>> > There are probably more things I could come up with, but I didn't try
>> > hard. Wiping out and recreating the buildd chroot isn't an option.
>> > Neither is creating a new one alongside the original, unless the disk
>> > space requirements are a non-issue (which isn't true for some of our
>> > archs).
>> 
>> Worst case would be to stop the buildd in such a condition. 
>
> You're advocating manual cleanup again here :-P

Yes. better than keeping on building with a broken system as is done now.

>> Many buildd machines should be able to do better.
>> 
>> > The only other option I could think of is to implement an AI that would
>> > investigate the chroot and remove any anomalies before restarting the
>> > next build, obviously all the while creating a perfectly detailed log
>> > (as in, exactly the amount of details you'll need to learn about what
>> > went wrong; nothing more and nothing less). That'd be nice to have, I'd
>> > say... ;-)
>> > 
>> > Really, such cleanups can't be properly automated IMO. I agree that
>> > there are cases where buildd could be improved, but that doesn't mean
>> > manual cleanups can be avoided; and after a system crash, if a cleanup
>> > is required, it must be done manually.
>> 
>> There will surely remain some cases where automatic cleanup isn't
>> possible, but handling common failure modes automatically should work.
>
> Luckily, "buildd crashed" isn't one of them common failure modes. My
> point (which you apparently missed) was that yes, there is one case
> where buildd loses track of some packages, and yes, that's probably a
> bug, but it's a corner case which requires manual cleanup anyway (for
> other reasons), so having to check whether there are lost packages isn't
> a real issue.

Here another feature of multibuild comes to mind.

Multibuild keeps track of the build times of packages and the relative
speeds of buildds. Multibuild can then guess the expected build time
for a package. If that time is exceeded by a sizeable margin the
buildd admin can be notified and on inaction the package will be
returned so another buildd can have a shot at it.

The same goes for pakages getting stuck in other temporary states,
like being state uploaded for a week.

Packages that have finished being build will remain in the buildd
admins control only for a limited time before getting assigned to a
pool for that arch or maybe even a general pool of all buildd admins.
Packages that aren't handled by the buildd admin for some reason (like
sickness) get then processed by any admin having some spare time to
process the pool.

The design tries to avoid single point of failure. The sudden absence
of a buildd or an admin should not disrupt the process.

MfG
        Goswin



Reply to: