On Tue, Feb 14, 2023 at 09:42:09PM +0100, Paul Gevers wrote: > Hi Phil, > > On 13-02-2023 08:57, Philipp Kern wrote: > > On 12.02.23 22:38, Paul Gevers wrote: > > > I have munin [1], but as said, I'm not a trained sysadmin. I don't > > > know what I'm looking for if you ask "statistics on the network". > > > > This is more of a software development / devops question than a sysadmin > > question, but alas. > > I acknowledge that my reach out was broad and didn't only cover s390x. > > > What I am interested in is *application-level* logging on reconnects. > > Presumably the connection to RabbitMQ is outbound? > > Our configuration can be seen here: > https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb > > > Is it tunneled? Does your application log somewhere when a reconnect > > happens? Does it say when it successfully connected? > > > > I'd expect good software to log something like this: > > > > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... > > [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345". > > > > And also: > > > > [10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... > > [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345" > > failed: Connection refused > > @terceiro; I haven't seen these kind of logs on the worker hosts. Do you > know if they exist or if we can generate them? The worker does log it's initial connection, see below. > I think I'm seeing something on the main host. > admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 > rabbit@ci-master.log | grep -v '\[info\]' | grep -v '\[warning\]' > 2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection > <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671): > 2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection > <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671): > 2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection > <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671): > 2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection > <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671): > 2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection > <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671): > 2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection > <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671): > > and a lot more warnings (220 times in 20 hours) as well; like: > 2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection > <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user: > 'guest'): > > And a lot (around 544) (obviously I don't know if that's only or even > includes the s390x host): > client unexpectedly closed TCP connection root@ci-worker-s390x-01:~# journalctl -u debci-worker@1.service --since='2 days ago' -t debci -b 0 | grep amqp Feb 15 15:10:21 ci-worker-s390x-01 debci[663]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 17:58:53 ci-worker-s390x-01 debci[2740543]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 19:23:40 ci-worker-s390x-01 debci[1855652]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 19:28:12 ci-worker-s390x-01 debci[1939916]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 15 20:50:51 ci-worker-s390x-01 debci[783145]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 15 21:36:25 ci-worker-s390x-01 debci[1966510]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 15 22:21:43 ci-worker-s390x-01 debci[3243793]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again... Feb 16 00:29:41 ci-worker-s390x-01 debci[4119188]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 01:21:26 ci-worker-s390x-01 debci[2097411]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 03:02:20 ci-worker-s390x-01 debci[1133799]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 06:40:46 ci-worker-s390x-01 debci[953820]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 08:00:24 ci-worker-s390x-01 debci[2875496]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 09:59:09 ci-worker-s390x-01 debci[3864527]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 11:47:09 ci-worker-s390x-01 debci[2310984]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 14:08:01 ci-worker-s390x-01 debci[1968077]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 16:58:24 ci-worker-s390x-01 debci[2496027]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 18:40:23 ci-worker-s390x-01 debci[1074224]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 20:35:15 ci-worker-s390x-01 debci[824124]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 16 22:23:07 ci-worker-s390x-01 debci[3567266]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 17 01:16:26 ci-worker-s390x-01 debci[40757]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 17 04:22:34 ci-worker-s390x-01 debci[1275755]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 17 06:24:45 ci-worker-s390x-01 debci[448310]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net Feb 17 08:26:47 ci-worker-s390x-01 debci[830390]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net So, for example, we sometimes see errors when publishing results (the retry: lines), so maybe we are also having problems getting jobs. Also, it seems the client connects again several times a day, what is definitively not normal (the amqp connection is supposed to be persistent). I checked for example one of the arm64 on the Huawei cloud -- since it's also not inside AWS, but on a different network path -- and it didn't need to reconnect even once in the last full days, having processed > 500 jobs in that time. So there is for sure something wrong with the client-server connection there. Reworking the client for robustness is on my TODO list for a while.
Attachment:
signature.asc
Description: PGP signature