Bug#977904: marked as done (nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot)
Your message dated Thu, 24 Nov 2022 23:38:43 +0100
with message-id <Y3/yc9gQz06xmWaa@eldamar.lan>
and subject line Re: Bug#977904: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
has caused the Debian Bug report #977904,
regarding nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)
--
977904: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=977904
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
- To: Debian Bug Tracking System <submit@bugs.debian.org>
- Subject: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
- From: Stephen Dowdy <sdowdy@ucar.edu>
- Date: Tue, 22 Dec 2020 16:39:14 +0000
- Message-id: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>
Package: nfs-common
Version: 1:1.3.4-2.1+deb9u1
Severity: important
Dear Maintainer,
FYI: /usr/share/bug/nfs-common/script warns of error.
in my case: 'cat /etc/fstab|grep nfs >&3' returns 1 due to 'grep'
fail (my nfs is all in autofs). should probably '|| true' that.
to avoid user confusion in bugreport. same for other grep statements.
Note:
This is a doozy -- a whole tower of fail (some my fault for
implementing an earlier workaround for nfs deadlocks and forgetting
it was there). So i'm not sure what package to report it against, but
the core trigger is in /usr/sbin/start-statd so starting there. Yes,
my system is likely configured in a pretty non-standard way, having
an /etc/nfsmount.conf forcing nfsv3 to avoid a deadlock in earlier
systems coming back to bite me as a deadlock in the future :-(
other subsystems involved:
- systemd
- dbus-daemon
- autofs
ultimately, my goal here is to help establish a robust systemd-capable
coordination in the various parts here to avoid another similar issue
due to these inter-dependencies. I don't know if 'start-statd' being
re-written to take systemd state into account is the correct solution,
but IMHO, systemd/dbus-daemon are utterly fragile in this situation
and extremely difficult to debug (need systemctl to do stuff, but it
won't work, and you can't restart dbus-daemon w/o systemctl, and kill
-TERM on pid 1 doesn't work ...
Summary:
- system configured for NFSv3 mounts via /etc/nfsmount.conf
note: this was to workaround a bug in NFSv4.[012] that
caused deadlocks against NFSv3 servers running Jessie.
i do not recall the bug #
something changed in a recent Stretch patchlevel as this
was working fine up until i patched and rebooted.
- systemd unit rpc-statd.service is disabled
- automount/autofs -> nfs is called triggering start-statd
that makes a 'systemctl start rpc-statd' that takes down
dbus-daemon and never completes.
- regardless of where the blame lies, it is possible that is wrong to
call 'systemctl' from inside 'start-statd' *if* it's being called
from a systemd unit itself.
If system is configured for NFS v3 mounts via /etc/nfsmount.conf
and systemctl unit 'rpc-statd' is disabled, then the automounter
creates a chain in boot (at least in our system case) that forcibly
tries to run 'systemctl start rpc-statd' via /usr/sbin/start-statd.
This results in systemctl call not completing (i don't know if
it's because systemctl calls can't be nested or called outside normal
startup flow or what), and eventually dbus-daemon stops responding
(so it could be a bug that needs to be transferred there). this locks
up the entire boot process. systemctl calls all timeout.
dbus-daemon is sitting in EAGAIN (resource temporarily unavailable)
Additionally, i wasn't able to ssh in (even though systemd had started
sshd) because of 'pam_motd' in /etc/pam.d/sshd calling update-motd,
which also blocked hard and never completed and was uninterruptable.
once i commented 'pam_motd' out, i could ssh in, and <CTRL>C something
hanging on nfs to get a shell. (again, tower of fail)
once in, if i killed the 'systemctl start rpc-statd', the system would
return to responsiveness. (systemctl could again contact dbus-daemon)
systemd-cgls showed:
+-autofs.service
| +-1453 /usr/sbin/automount --pid-file /var/run/autofs.pid
| +-1465 /bin/mount -t nfs -s -o intr,nodev,nosuid
ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
| +-1466 /sbin/mount.nfs
ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
-s -o rw,nodev,nosuid,intr
| +-1467 /bin/sh /usr/sbin/start-statd
| -1470 systemctl start rpc-statd.service
^^^^ this hangs dbus-daemon and brings down the
whole systemd kingdom.
before it hung, ...
puffin:/etc/default/grub.d# systemctl list-jobs
TYPE STATE
607 apt-daily.service start running
462 nfs-config.service start running
468 apt-daily-upgrade.service start waiting
460 rpc-statd-notify.service start waiting
453 rpc-statd.service start waiting
464 systemd-tmpfiles-clean.service start running
Note: 'ral-local-linux' is our NFS-shared /usr/local. this may have
been triggered early due to 'cron' being started and user '@reboot' jobs
launching.
Note: i have a lot of systemd debug and other captured logs i can
provide if needed.
here's the /etc/nfsmount.conf that was being used prior:
[ NFSMount_Global_Options ]
nfsvers=3
Thanks,
--stephen
-- Package-specific info:
-- rpcinfo --
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 48853 mountd
100005 1 tcp 45675 mountd
100005 2 udp 56398 mountd
100005 2 tcp 58131 mountd
100005 3 udp 49109 mountd
100005 3 tcp 48261 mountd
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100227 3 tcp 2049
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100227 3 udp 2049
100021 1 udp 54879 nlockmgr
100021 3 udp 54879 nlockmgr
100021 4 udp 54879 nlockmgr
100021 1 tcp 41063 nlockmgr
100021 3 tcp 41063 nlockmgr
100021 4 tcp 41063 nlockmgr
100007 2 udp 806 ypbind
100007 1 udp 806 ypbind
100007 2 tcp 807 ypbind
100007 1 tcp 807 ypbind
100024 1 udp 58391 status
100024 1 tcp 34239 status
-- /etc/default/nfs-common --
NEED_STATD=
STATDOPTS=
NEED_IDMAPD=yes
NEED_GSSD=
-- /etc/idmapd.conf --
[General]
Verbosity = 0
Pipefs-Directory = /run/rpc_pipefs
[Mapping]
Nobody-User = nobody
Nobody-Group = nogroup
-- /etc/fstab --
-- System Information:
Debian Release: 9.13
APT prefers oldstable-updates
APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 4.9.0-14-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages nfs-common depends on:
ii adduser 3.115
ii init-system-helpers 1.48
ii keyutils 1.5.9-9
ii libc6 2.24-11+deb9u4
ii libcap2 1:2.25-1
ii libcomerr2 1.43.4-2+deb9u2
ii libdevmapper1.02.1 2:1.02.137-2
ii libevent-2.0-5 2.0.21-stable-3
ii libgssapi-krb5-2 1.15-1+deb9u2
ii libk5crypto3 1.15-1+deb9u2
ii libkeyutils1 1.5.9-9
ii libkrb5-3 1.15-1+deb9u2
ii libmount1 2.29.2-1+deb9u1
ii libnfsidmap2 0.25-5.1
ii libtirpc1 0.2.5-1.2+deb9u1
ii libwrap0 7.6.q-26
ii lsb-base 9.20161125
ii rpcbind 0.2.3-0.6
ii ucf 3.0036
Versions of packages nfs-common recommends:
ii python 2.7.13-2
Versions of packages nfs-common suggests:
pn open-iscsi <none>
pn watchdog <none>
Versions of packages nfs-kernel-server depends on:
ii init-system-helpers 1.48
ii keyutils 1.5.9-9
ii libblkid1 2.29.2-1+deb9u1
ii libc6 2.24-11+deb9u4
ii libcap2 1:2.25-1
ii libsqlite3-0 3.16.2-5+deb9u3
ii libtirpc1 0.2.5-1.2+deb9u1
ii libwrap0 7.6.q-26
ii lsb-base 9.20161125
ii netbase 5.4
ii ucf 3.0036
-- Configuration Files:
/etc/default/nfs-common changed [not included]
-- no debconf information
--- End Message ---
--- Begin Message ---
- To: Stephen Dowdy <sdowdy@ucar.edu>, 977904-done@bugs.debian.org
- Subject: Re: Bug#977904: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
- From: Salvatore Bonaccorso <carnil@debian.org>
- Date: Thu, 24 Nov 2022 23:38:43 +0100
- Message-id: <Y3/yc9gQz06xmWaa@eldamar.lan>
- In-reply-to: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>
- References: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>
Hi Stephen,
On Tue, Dec 22, 2020 at 04:39:14PM +0000, Stephen Dowdy wrote:
> Package: nfs-common
> Version: 1:1.3.4-2.1+deb9u1
> Severity: important
>
> Dear Maintainer,
>
> FYI: /usr/share/bug/nfs-common/script warns of error.
> in my case: 'cat /etc/fstab|grep nfs >&3' returns 1 due to 'grep'
> fail (my nfs is all in autofs). should probably '|| true' that.
> to avoid user confusion in bugreport. same for other grep statements.
>
> Note:
> This is a doozy -- a whole tower of fail (some my fault for
> implementing an earlier workaround for nfs deadlocks and forgetting
> it was there). So i'm not sure what package to report it against, but
> the core trigger is in /usr/sbin/start-statd so starting there. Yes,
> my system is likely configured in a pretty non-standard way, having
> an /etc/nfsmount.conf forcing nfsv3 to avoid a deadlock in earlier
> systems coming back to bite me as a deadlock in the future :-(
>
> other subsystems involved:
> - systemd
> - dbus-daemon
> - autofs
>
> ultimately, my goal here is to help establish a robust systemd-capable
> coordination in the various parts here to avoid another similar issue
> due to these inter-dependencies. I don't know if 'start-statd' being
> re-written to take systemd state into account is the correct solution,
> but IMHO, systemd/dbus-daemon are utterly fragile in this situation
> and extremely difficult to debug (need systemctl to do stuff, but it
> won't work, and you can't restart dbus-daemon w/o systemctl, and kill
> -TERM on pid 1 doesn't work ...
>
> Summary:
> - system configured for NFSv3 mounts via /etc/nfsmount.conf
> note: this was to workaround a bug in NFSv4.[012] that
> caused deadlocks against NFSv3 servers running Jessie.
> i do not recall the bug #
> something changed in a recent Stretch patchlevel as this
> was working fine up until i patched and rebooted.
> - systemd unit rpc-statd.service is disabled
> - automount/autofs -> nfs is called triggering start-statd
> that makes a 'systemctl start rpc-statd' that takes down
> dbus-daemon and never completes.
> - regardless of where the blame lies, it is possible that is wrong to
> call 'systemctl' from inside 'start-statd' *if* it's being called
> from a systemd unit itself.
>
> If system is configured for NFS v3 mounts via /etc/nfsmount.conf
> and systemctl unit 'rpc-statd' is disabled, then the automounter
> creates a chain in boot (at least in our system case) that forcibly
> tries to run 'systemctl start rpc-statd' via /usr/sbin/start-statd.
>
> This results in systemctl call not completing (i don't know if
> it's because systemctl calls can't be nested or called outside normal
> startup flow or what), and eventually dbus-daemon stops responding
> (so it could be a bug that needs to be transferred there). this locks
> up the entire boot process. systemctl calls all timeout.
> dbus-daemon is sitting in EAGAIN (resource temporarily unavailable)
>
> Additionally, i wasn't able to ssh in (even though systemd had started
> sshd) because of 'pam_motd' in /etc/pam.d/sshd calling update-motd,
> which also blocked hard and never completed and was uninterruptable.
> once i commented 'pam_motd' out, i could ssh in, and <CTRL>C something
> hanging on nfs to get a shell. (again, tower of fail)
>
> once in, if i killed the 'systemctl start rpc-statd', the system would
> return to responsiveness. (systemctl could again contact dbus-daemon)
>
> systemd-cgls showed:
>
> +-autofs.service
> | +-1453 /usr/sbin/automount --pid-file /var/run/autofs.pid
> | +-1465 /bin/mount -t nfs -s -o intr,nodev,nosuid
> ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
> | +-1466 /sbin/mount.nfs
> ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
> -s -o rw,nodev,nosuid,intr
> | +-1467 /bin/sh /usr/sbin/start-statd
> | -1470 systemctl start rpc-statd.service
> ^^^^ this hangs dbus-daemon and brings down the
> whole systemd kingdom.
>
> before it hung, ...
> puffin:/etc/default/grub.d# systemctl list-jobs
> TYPE STATE
> 607 apt-daily.service start running
> 462 nfs-config.service start running
> 468 apt-daily-upgrade.service start waiting
> 460 rpc-statd-notify.service start waiting
> 453 rpc-statd.service start waiting
> 464 systemd-tmpfiles-clean.service start running
>
>
>
> Note: 'ral-local-linux' is our NFS-shared /usr/local. this may have
> been triggered early due to 'cron' being started and user '@reboot' jobs
> launching.
>
> Note: i have a lot of systemd debug and other captured logs i can
> provide if needed.
>
> here's the /etc/nfsmount.conf that was being used prior:
> [ NFSMount_Global_Options ]
> nfsvers=3
In sense of BTS housekeeping I'm closing this older bugreport,
assuming it does not happen anymore with a recent supported version.
But by all means, if you still can reproduce the problem with newer
nfs-utils versions in stable or above, please do reopen the bug.
Regards,
Salvatore
--- End Message ---
Reply to: