[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: buildd reliability



On 2023-03-30 19:59, Wookey wrote:
> On 2023-03-26 12:25 +0200, Aurelien Jarno wrote:
> 
> > The 3 arm64 boards running at ARM are pretty fine, we do not have any
> > issues with them, however they start to be old.
> > 
> > On the other hand we have many issues with the Ampere servers hosted at
> > UBC and the Applied Micro servers hosted at Conova. All of them crash
> > regularly (a few times per week in total) and need a powercycle. In
> > addition the bullseye kernel does not work on Applied Micro servers, so
> > we are currently stuck with buster on them :(.
> 
> OK. That's not good. Can you say which hardware those machines are?
> Our buildd database does not say what actual kit is in use (just the
> manufacturer), and I don't have rights to read the detailed buildd
> admin info on the UBC and conova sites.
> 
> [ Aside: what would it take to put an extra field into our machine
> database to specify what hardware each machine was? It can sometimes
> be tricky to separate Model/motherboard/CPU as the required bit of
> info but it would be really useful to write something more detailed
> down both for issues like this and debugging. ]

This kind of info is usually specified in the Processor field, but only for
physical hosts.

> My guess is that all the Conova machine are Mustangs, and the UBC machines 
> are emags? Is that right?

Unfortunately we don't have a lot of informations about the Conova
machines, that is conova-node01 and conova-node02. The boot log of the
Conova machines mentions "Machine model: Gigabyte X-Gene MP30-AR0
board".

The UBC machines are ubc-node-arm04, ubc-node-arm05 and ubc-node-arm06.
There are Lenovo HR330A, and according to db.debian.org the CPU are
Ampere eMAG 8180 64-bit Arm @ 3.3GHz.

Feel free to give us some instructions about how to get more details
about the hardware.

> Some enquiries tell me that both these machines types are reliable
> (although the mustangs are slow) at OBS and Yocto, so they can be OK,
> but there is certainly much faster kit available now (Ampere Altra).
> 
> Is there a bug about the boot failure on the Applied Micro machines? I
> just failed to find one. If we know what hardware it is we can
> investigate, because that does seem like something that should be
> fixed.

No, we failed to track the issue, we just have a note in the DSA RT
ticket about the replacement of these hosts as they are quite old (~7
years old).

I will try to boot the bullseye kernel again (last time was 1.5 years
ago) and come back to you.

> > > I'm sure we can get new arm64 buildds if we need them.
> > 
> > Yes please. It's becoming urgent to get new ARM64 hardware to overcome
> > all those issues, and we (DSA) failed to find new hardware to buy at a
> > decent price.
> 
> OK. I'll see what can be done. I see Altra servers are from
> $7000-$53000 on https://store.avantek.co.uk/arm-servers.html.

Thanks for the pointer, it's not something we have studied yet.

> What does DSA consider 'decent'? I guess we'd prefer the resilience of
> a couple of reasonable machines over one ridiculously manly one. A bit
> of configury on the Aventek site suggests that basic ARM Altra servers
> cost about twice as much as AMD ones for similar specs
> (cores/RAM/disk), but then the power consumption is less than half. I
> don't know how the performance actually compares for buildd purposes
> (nor what sort of spec we prefer in terms of
> nodes/cores/RAM/Disk/networkIF), but people describe the Altra's as
> 'fast'. I'll try and collect some more details to quantify that.

What we consider decent, is in the same range than similar x86 hardware.
It seems that you also found that the ARM64 hardware is way more
expensive.

Given we need to support 3 architectures (armhf, armel, arm64) with the
same hardware, we run buildds and porterboxes as VM using ganeti, on two
nodes per side, to allow live migration for maintenance (including
software upgrades) and failure of one node. We currently run 6 buildds
at UBC, and 3 buildds + 1 porterbox at Conova. If we want to be able to
eventually retire the buildds at ARM (but so far they work quite well),
we need to be able to run 6 VMs on each side. 

So in term of specification, that probably means something around 24 cores
per node, 96GB RAM, and 2TB disk. Maybe a tad more so that they can be
used for the next 5 to 7 years.

> Does Debian run to a policy on packages/Wh for buildds yet, I wonder
> (efficient hardware lowers emissions, for a given workload)? It's
> worth paying something for more power-efficient kit, possibly quite a
> lot for hardware like this that will run hard for years.

As we struggle with buying hardware, we are not picky about that.
However from my understanding the arm64 hardware is supposed to be more
efficient than x86 one.

> Are we running debian CI on this hardware or is that all done in the
> cloud?

debian CI is not ran by DSA, so not on this hardware, but I don't know
exactly where it is run.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

Attachment: signature.asc
Description: PGP signature


Reply to: