[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bizzare bug: Systemd-logind session creation and prometheus-node-exporter systemd collector fail with message "Failed to activate service 'org.freedesktop.systemd1': timed out "



Hi all, thanks for your time! Please let me know if this isn't the right spot for this, or if I should file a bug.

I'm on debian buster, package versions for ones in question are:
systemd - 241-7~deb10u8
prometheus-node-exporter - 0.17.0+ds-3+b11
dbus - 1.12.20-0+deb10u1
pacemaker - 2.0.1-5+deb10u1

This has been a bit of a head scratcher for me and some others for some time now. Quick environmental description; I have two-node clusters running buster, 
supporting pacemaker with some systemd and other typed resources, and exporting metrics via prometheus-node-exporter. 
There is also a service that once a minute, from the "active" node in the 2 node cluster, logs into the other node via rsync over ssh.

Bizarrely, with enough uptime systemd-logind, prometheus-node-exporter, and pacemaker, will all start to log errors of the type: 
                                      "Failed to activate service 'org.freedesktop.systemd1'". 
This of course causes pacemaker to snipe the "failed" node since the resources there can no longer pass the status checks.

Uptime varies wildly between reproductions. Currently we've observed 5, with uptimes of 45,48,62,81,100 days respectively. The only difference in those incidents 
is that the machine with an uptime of 48 days before it was hit was not running any of the pacemaker resources. I'm going to include some of the logs from this machine
 in particular as it seems to be the most "pure" reproduction of the issue, and is the only one that survived pacemaker's ire.

The first incidence of the error from systemd-logind:
Jul 31 19:50:50 nosferatu systemd-logind[4042]: New session 125887 of user rsync-user.
Jul 31 19:50:50 nosferatu systemd-logind[4042]: Session 125887 logged out. Waiting for processes to exit.
Jul 31 19:50:50 nosferatu systemd-logind[4042]: Removed session 125887.
Jul 31 19:51:50 nosferatu systemd-logind[4042]: New session 125889 of user rsync-user.
Jul 31 19:51:51 nosferatu systemd-logind[4042]: Session 125889 logged out. Waiting for processes to exit.
Jul 31 19:51:51 nosferatu systemd-logind[4042]: Removed session 125889.
Jul 31 19:52:52 nosferatu systemd-logind[4042]: Failed to start user service 'user@999.service', ignoring: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:53:17 nosferatu systemd-logind[4042]: Failed to start session scope session-125891.scope: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:53:42 nosferatu systemd-logind[4042]: Failed to stop user service 'user@999.service', ignoring: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:07 nosferatu systemd-logind[4042]: Failed to start user service 'user@999.service', ignoring: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:32 nosferatu systemd-logind[4042]: Failed to start session scope session-125892.scope: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:57 nosferatu systemd-logind[4042]: Failed to stop user service 'user@999.service', ignoring: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
 
Dbus-daemon logs a similar error:
Jun 20 12:45:01 nosferatu systemd[1]: Started D-Bus System Message Bus.
Jul 31 19:52:27 nosferatu dbus-daemon[4047]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

Prometheus-node-exporter reports the same:
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]: time="2022-06-20T09:45:01-07:00" level=info msg=" - vmstat" source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]: time="2022-06-20T09:45:01-07:00" level=info msg=" - xfs" source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]: time="2022-06-20T09:45:01-07:00" level=info msg=" - zfs" source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]: time="2022-06-20T09:45:01-07:00" level=info msg="Listening on :9100" source="node_exporter.go:170"
Jul 31 19:52:27 nosferatu prometheus-node-exporter[4025]: time="2022-07-31T16:52:27-07:00" level=error msg="ERROR: systemd collector failed after 0.351864s: couldn't get units: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)" source="collector.go:132"

From the logs, it appears that prometheus-node-exporter is the first one to hit the problem, and then is joined in a degraded waltz by systemd-logind until 1 second after midnight, when;
Aug 01 00:00:01 nosferatu dbus-daemon[4047]: [system] Successfully activated service 'org.freedesktop.systemd1'

Then after this, systemd-logind and prometheus-node-exporter stop failing with the timeout message, but something interesting happens; systemd-logind loses the ability to fully close 
the new sessions being opened, and eventually the once-a-minute login process causes the node to max out on user sessions. 
You can sort of see this in the logs, since no longer are sessions removed after the "Waiting for processes to exit message";
Aug 07 13:48:50 nosferatu systemd-logind[4042]: New session 136763 of user rsync-user.
Aug 07 13:48:50 nosferatu systemd-logind[4042]: Session 136763 logged out. Waiting for processes to exit.
Aug 07 13:49:50 nosferatu systemd-logind[4042]: New session 136764 of user rsync-user.
Aug 07 13:49:50 nosferatu systemd-logind[4042]: Session 136764 logged out. Waiting for processes to exit. <-8192 total user sessions active here
Aug 21 21:12:51 nosferatu systemd-logind[4042]: Power key pressed. <- When I rebooted the machine over IPMI so I could log in :(
Aug 21 21:12:51 nosferatu systemd-logind[4042]: Powering Off...

I know from other instances that performing `start-stop-daemon --stop --quiet --pidfile <PIDOFASERVICE>` causes systemd to reload with the same message in dbus of 
"Successfully activated service"..., so I imagine this resurrection occurred due to some nightly service, though what exactly I cannot tell, since logs had been rotated before I started to look at the system. 
 
Of additional though maybe spurious interest is the fact that login sessions increment oddly; scanning the logs, the session number goes up by 2 for every new connection, 
except on the 10th connection or so, where it goes up by 3.

I have tried checking for session leaks with loginctl, bus connection leaks with busctl, confirmed that the OOM killer wasn't active, confirmed that CPU util was within reason,
 and most other common-sense checks. The systems are otherwise healthy, and so far this hasn't happened on the same machine twice. 

As you may imagine, I'm thoroughly flummoxed by this issue, and insofar I've been unable to reproduce it by doing things like increasing the interval of the prometheus collector or the rsync service.  Normally, I'd just update
and presume that'd solve the issue, but without knowledge of why this issue is occuring I'm hesitant to call anything a fix.

Does this ring any bells to anyone? Any known bugs I'm hitting here with some subcomponent? Does anyone have any advice for useful debug steps that could get me closer to a solution? 
I've also got many, many, more logs available if there's anything that may be of use.

Cheers, and thanks ever so much for your time and attention!
Christian



Reply to: