[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: help needed to manage s390x host for ci.debian.net



On Tue, Feb 14, 2023 at 09:42:09PM +0100, Paul Gevers wrote:
> Hi Phil,
> 
> On 13-02-2023 08:57, Philipp Kern wrote:
> > On 12.02.23 22:38, Paul Gevers wrote:
> > > I have munin [1], but as said, I'm not a trained sysadmin. I don't
> > > know what I'm looking for if you ask "statistics on the network".
> > 
> > This is more of a software development / devops question than a sysadmin
> > question, but alas.
> 
> I acknowledge that my reach out was broad and didn't only cover s390x.
> 
> > What I am interested in is *application-level* logging on reconnects.
> > Presumably the connection to RabbitMQ is outbound?
> 
> Our configuration can be seen here:
> https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb
> 
> > Is it tunneled? Does your application log somewhere when a reconnect
> > happens? Does it say when it successfully connected?
> > 
> > I'd expect good software to log something like this:
> > 
> > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
> > [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".
> > 
> > And also:
> > 
> > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"...
> > [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345"
> > failed: Connection refused
> 
> @terceiro; I haven't seen these kind of logs on the worker hosts. Do you
> know if they exist or if we can generate them?

The worker does log it's initial connection, see below.

> I think I'm seeing something on the main host.
> admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163
> rabbit@ci-master.log | grep -v '\[info\]' |  grep -v '\[warning\]'
> 2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection
> <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671):
> 2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection
> <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671):
> 2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection
> <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671):
> 2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection
> <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671):
> 2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection
> <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671):
> 2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection
> <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671):
> 
> and a lot more warnings (220 times in 20 hours) as well; like:
> 2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection
> <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user:
> 'guest'):
> 
> And a lot (around 544) (obviously I don't know if that's only or even
> includes the s390x host):
> client unexpectedly closed TCP connection

root@ci-worker-s390x-01:~# journalctl -u debci-worker@1.service --since='2 days ago' -t debci -b 0 | grep amqp
Feb 15 15:10:21 ci-worker-s390x-01 debci[663]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 17:58:53 ci-worker-s390x-01 debci[2740543]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:23:40 ci-worker-s390x-01 debci[1855652]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:28:12 ci-worker-s390x-01 debci[1939916]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 15 20:50:51 ci-worker-s390x-01 debci[783145]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 21:36:25 ci-worker-s390x-01 debci[1966510]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 15 22:21:43 ci-worker-s390x-01 debci[3243793]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 16 00:29:41 ci-worker-s390x-01 debci[4119188]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 01:21:26 ci-worker-s390x-01 debci[2097411]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 03:02:20 ci-worker-s390x-01 debci[1133799]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 06:40:46 ci-worker-s390x-01 debci[953820]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 08:00:24 ci-worker-s390x-01 debci[2875496]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 09:59:09 ci-worker-s390x-01 debci[3864527]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 11:47:09 ci-worker-s390x-01 debci[2310984]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 14:08:01 ci-worker-s390x-01 debci[1968077]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 16:58:24 ci-worker-s390x-01 debci[2496027]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 18:40:23 ci-worker-s390x-01 debci[1074224]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 20:35:15 ci-worker-s390x-01 debci[824124]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 22:23:07 ci-worker-s390x-01 debci[3567266]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 01:16:26 ci-worker-s390x-01 debci[40757]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 04:22:34 ci-worker-s390x-01 debci[1275755]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 06:24:45 ci-worker-s390x-01 debci[448310]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 08:26:47 ci-worker-s390x-01 debci[830390]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net

So, for example, we sometimes see errors when publishing results (the
retry: lines), so maybe we are also having problems getting jobs.

Also, it seems the client connects again several times a day, what is
definitively not normal (the amqp connection is supposed to be
persistent). I checked for example one of the arm64 on the Huawei cloud
-- since it's also not inside AWS, but on a different network path --
and it didn't need to reconnect even once in the last full days, having
processed > 500 jobs in that time.

So there is for sure something wrong with the client-server connection
there. Reworking the client for robustness is on my TODO list for a
while.

Attachment: signature.asc
Description: PGP signature


Reply to: