[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

help needed to manage s390x host for ci.debian.net



Dear all,

This is a call for help, mostly for s390x specifically and debugging our s390x host,

ci.debian.net is the infrastructure that enables the Debian Release Team to run autopkgtest as part of the quality assurance for unstable-to-testing migration. Historically, that used to be exclusively amd64, but the last two years that has been extended with arm64, armhf, armel, i386, ppc64el and s390x. We have one s390x VM (generously provided by IBM) hosted at Marist College, running 10 debci workers in parallel.

Now there are a couple of things that make me reach out, because I am coming to the conclusion that I can't handle it myself. I have the impression that the s390x host isn't delivering what it's capable of. Instead of asking for more resources (which I believe would be granted), I believe we should first try to see if there's not items we can fix on the Debian side. There's a couple of observations and ideas that I have.

* [observation] contrary to other architecture the queue (rabbitmq) for s390x doesn't empty anymore. Even if there are no package processing and the database doesn't know about pending jobs, there are typically (over time increasing) tests left in the queue. * [observation] with jobs in the queue, the amount of packages being processed [1] often isn't equal to the amount of debci workers, meaning they are idle/waiting. Compare that e.g. to our amd64 host #13 [2] where if there's a queue, the number of processed packages is flat at the number of debci workers * [observation] we checked the average time a test runs on s390x and it doesn't deviate much from the other architectures, and it's for sure not the worse. However, the amount of tests processed per day per debci worker is the lowest of all architectures [3], easily half of i386 which has 11 workers vs 10 on s390x. * [observation] in general, I believe that we could setup our hosts to use tmpfs for the testbed, because I really believe that installing the packages for the test is taking a considerable amount of time and for short test, mean everything. Most perl tests only take seconds, while preparing the testbed are multiple 10s of seconds. If we would trade disk for memory, I think we could get much more out of any host with the same amount of CPU. * [suspect 1] network issues between the s390x and the main ci.d.n server (the results (log files) of the autopkgtests are transferred to the main server). Our ppc64el hosts are also located at Marist, so I would expect commonality here, but also ppc64el isn't performing great, so maybe part of the problem is common. * [idea] maybe the queuing and back reporting in debci could be improved to transfer the logs separately from the result, such that the transfer doesn't block the workers from picking up new tasks.

What drove me here is that several days ago I updated the VM (it runs bullseye) and as part of that a new kernel was installed. After the reboot, it looks like mariadb doesn't want to install anymore in the testsbeds (the test hang and timeout). See e.g. https://ci.debian.net/packages/d/dbconfig-common/testing/s390x/ There is a FTBFS in mariadb also related to s390x: #1030510 [4] where it was hinted to be due to a bad kernel.

On top of that, I have observed very often that after a reboot of the host the amount packages process in the first few days is considerably lower (1/10 or so) than normally. I'm not seeing anything deviating on the system, except that. I have no ideas anymore where to look.

Personally I don't really care about s390x and it's costing me more time than the other architectures. I don't want to anymore.

Paul
PS: I may have forgotten some observations and ideas, but the message is long enough already and I wanted to send it.

[1] https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_being_processed.html [2] https://ci.debian.net/munin/ci-worker13/ci-worker13/debci_packages_being_processed.html [3] https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_total_packages_processed.html
[4] https://bugs.debian.org/1030510

$ rake capacity
amd64           17
arm64           20
armel           12
armhf           12
i386            11
ppc64el          8
riscv64         18
s390x           10

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


Reply to: