Bug#977904: marked as done (nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot)

To: Salvatore Bonaccorso <carnil@debian.org>
Subject: Bug#977904: marked as done (nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot)
From: "Debian Bug Tracking System" <owner@bugs.debian.org>
Date: Thu, 24 Nov 2022 22:42:03 +0000
Message-id: <[🔎] handler.977904.D977904.16693295283255151.ackdone@bugs.debian.org>
Reply-to: 977904@bugs.debian.org
References: <Y3/yc9gQz06xmWaa@eldamar.lan> <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>

Your message dated Thu, 24 Nov 2022 23:38:43 +0100
with message-id <Y3/yc9gQz06xmWaa@eldamar.lan>
and subject line Re: Bug#977904: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
has caused the Debian Bug report #977904,
regarding nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
977904: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=977904
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

--- Begin Message ---

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
From: Stephen Dowdy <sdowdy@ucar.edu>
Date: Tue, 22 Dec 2020 16:39:14 +0000
Message-id: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>

Package: nfs-common
Version: 1:1.3.4-2.1+deb9u1
Severity: important

Dear Maintainer,

FYI: /usr/share/bug/nfs-common/script warns of error.
  in my case:  'cat /etc/fstab|grep nfs >&3' returns 1 due to 'grep'
  fail (my nfs is all in autofs).  should probably '|| true' that.
  to avoid user confusion in bugreport.  same for other grep statements.

Note:
  This is a doozy -- a whole tower of fail (some my fault for
  implementing an earlier workaround for nfs deadlocks and forgetting
  it was there). So i'm not sure what package to report it against, but
  the core trigger is in /usr/sbin/start-statd so starting there. Yes,
  my system is likely configured in a pretty non-standard way, having
  an /etc/nfsmount.conf forcing nfsv3 to avoid a deadlock in earlier
  systems coming back to bite me as a deadlock in the future :-(

  other subsystems involved:
    - systemd
    - dbus-daemon
    - autofs

  ultimately, my goal here is to help establish a robust systemd-capable
  coordination in the various parts here to avoid another similar issue
  due to these inter-dependencies.  I don't know if 'start-statd' being
  re-written to take systemd state into account is the correct solution,
  but IMHO, systemd/dbus-daemon are utterly fragile in this situation
  and extremely difficult to debug (need systemctl to do stuff, but it
  won't work, and you can't restart dbus-daemon w/o systemctl, and kill
  -TERM on pid 1 doesn't work ...

Summary:
  - system configured for NFSv3 mounts via /etc/nfsmount.conf
    note: this was to workaround a bug in NFSv4.[012] that
    caused deadlocks against NFSv3 servers running Jessie.
    i do not recall the bug #
    something changed in a recent Stretch patchlevel as this
    was working fine up until i patched and rebooted.
  - systemd unit rpc-statd.service is disabled
  - automount/autofs -> nfs is called triggering start-statd
    that makes a 'systemctl start rpc-statd' that takes down
    dbus-daemon and never completes.
  - regardless of where the blame lies, it is possible that is wrong to
    call 'systemctl' from inside 'start-statd' *if* it's being called
    from a systemd unit itself.

  If system is configured for NFS v3 mounts via /etc/nfsmount.conf
  and systemctl unit 'rpc-statd' is disabled, then the automounter
  creates a chain in boot (at least in our system case) that forcibly
  tries to run 'systemctl start rpc-statd' via /usr/sbin/start-statd.

  This results in systemctl call not completing (i don't know if
  it's because systemctl calls can't be nested or called outside normal
  startup flow or what), and eventually dbus-daemon stops responding
  (so it could be a bug that needs to be transferred there).  this locks
  up the entire boot process.   systemctl calls all timeout.
  dbus-daemon is sitting in EAGAIN (resource temporarily unavailable)

  Additionally, i wasn't able to ssh in (even though systemd had started
  sshd) because of 'pam_motd' in /etc/pam.d/sshd calling update-motd,
  which also blocked hard and never completed and was uninterruptable.
  once i commented 'pam_motd' out, i could ssh in, and <CTRL>C something
  hanging on nfs to get a shell. (again, tower of fail)

  once in, if i killed the 'systemctl start rpc-statd', the system would
  return to responsiveness. (systemctl could again contact dbus-daemon)

  systemd-cgls showed:

  +-autofs.service
    | +-1453 /usr/sbin/automount --pid-file /var/run/autofs.pid
      | +-1465 /bin/mount -t nfs -s -o intr,nodev,nosuid
      ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
        | +-1466 /sbin/mount.nfs
	ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
	-s -o rw,nodev,nosuid,intr
	  | +-1467 /bin/sh /usr/sbin/start-statd
	    | -1470 systemctl start rpc-statd.service
	            ^^^^ this hangs dbus-daemon and brings down the
		    whole systemd kingdom.

  before it hung, ...
    puffin:/etc/default/grub.d# systemctl list-jobs
    TYPE  STATE  
    607 apt-daily.service              start running
    462 nfs-config.service             start running
    468 apt-daily-upgrade.service      start waiting
    460 rpc-statd-notify.service       start waiting
    453 rpc-statd.service              start waiting
    464 systemd-tmpfiles-clean.service start running



Note: 'ral-local-linux' is our NFS-shared /usr/local.  this may have
been triggered early due to 'cron' being started and user '@reboot' jobs
launching.

Note: i have a lot of systemd debug and other captured logs i can
provide if needed.

here's the /etc/nfsmount.conf that was being used prior:
	[ NFSMount_Global_Options ]
	     nfsvers=3

Thanks,
--stephen

-- Package-specific info:
-- rpcinfo --
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100005    1   udp  48853  mountd
    100005    1   tcp  45675  mountd
    100005    2   udp  56398  mountd
    100005    2   tcp  58131  mountd
    100005    3   udp  49109  mountd
    100005    3   tcp  48261  mountd
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100227    3   tcp   2049
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100227    3   udp   2049
    100021    1   udp  54879  nlockmgr
    100021    3   udp  54879  nlockmgr
    100021    4   udp  54879  nlockmgr
    100021    1   tcp  41063  nlockmgr
    100021    3   tcp  41063  nlockmgr
    100021    4   tcp  41063  nlockmgr
    100007    2   udp    806  ypbind
    100007    1   udp    806  ypbind
    100007    2   tcp    807  ypbind
    100007    1   tcp    807  ypbind
    100024    1   udp  58391  status
    100024    1   tcp  34239  status
-- /etc/default/nfs-common --
NEED_STATD=
STATDOPTS=
NEED_IDMAPD=yes
NEED_GSSD=
-- /etc/idmapd.conf --
[General]
Verbosity = 0
Pipefs-Directory = /run/rpc_pipefs
[Mapping]
Nobody-User = nobody
Nobody-Group = nogroup
-- /etc/fstab --

-- System Information:
Debian Release: 9.13
  APT prefers oldstable-updates
  APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.9.0-14-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages nfs-common depends on:
ii  adduser              3.115
ii  init-system-helpers  1.48
ii  keyutils             1.5.9-9
ii  libc6                2.24-11+deb9u4
ii  libcap2              1:2.25-1
ii  libcomerr2           1.43.4-2+deb9u2
ii  libdevmapper1.02.1   2:1.02.137-2
ii  libevent-2.0-5       2.0.21-stable-3
ii  libgssapi-krb5-2     1.15-1+deb9u2
ii  libk5crypto3         1.15-1+deb9u2
ii  libkeyutils1         1.5.9-9
ii  libkrb5-3            1.15-1+deb9u2
ii  libmount1            2.29.2-1+deb9u1
ii  libnfsidmap2         0.25-5.1
ii  libtirpc1            0.2.5-1.2+deb9u1
ii  libwrap0             7.6.q-26
ii  lsb-base             9.20161125
ii  rpcbind              0.2.3-0.6
ii  ucf                  3.0036

Versions of packages nfs-common recommends:
ii  python  2.7.13-2

Versions of packages nfs-common suggests:
pn  open-iscsi  <none>
pn  watchdog    <none>

Versions of packages nfs-kernel-server depends on:
ii  init-system-helpers  1.48
ii  keyutils             1.5.9-9
ii  libblkid1            2.29.2-1+deb9u1
ii  libc6                2.24-11+deb9u4
ii  libcap2              1:2.25-1
ii  libsqlite3-0         3.16.2-5+deb9u3
ii  libtirpc1            0.2.5-1.2+deb9u1
ii  libwrap0             7.6.q-26
ii  lsb-base             9.20161125
ii  netbase              5.4
ii  ucf                  3.0036

-- Configuration Files:
/etc/default/nfs-common changed [not included]

-- no debconf information

--- End Message ---

--- Begin Message ---

To: Stephen Dowdy <sdowdy@ucar.edu>, 977904-done@bugs.debian.org
Subject: Re: Bug#977904: nfs-common: start-statd calls 'systemctl start rpc-statd' which can hang dbus-daemon/systemd at boot
From: Salvatore Bonaccorso <carnil@debian.org>
Date: Thu, 24 Nov 2022 23:38:43 +0100
Message-id: <Y3/yc9gQz06xmWaa@eldamar.lan>
In-reply-to: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>
References: <160865515450.14133.9497230683099737632.reportbug@puffin.rap.ucar.edu>

Hi Stephen,

On Tue, Dec 22, 2020 at 04:39:14PM +0000, Stephen Dowdy wrote:
> Package: nfs-common
> Version: 1:1.3.4-2.1+deb9u1
> Severity: important
> 
> Dear Maintainer,
> 
> FYI: /usr/share/bug/nfs-common/script warns of error.
>   in my case:  'cat /etc/fstab|grep nfs >&3' returns 1 due to 'grep'
>   fail (my nfs is all in autofs).  should probably '|| true' that.
>   to avoid user confusion in bugreport.  same for other grep statements.
> 
> Note:
>   This is a doozy -- a whole tower of fail (some my fault for
>   implementing an earlier workaround for nfs deadlocks and forgetting
>   it was there). So i'm not sure what package to report it against, but
>   the core trigger is in /usr/sbin/start-statd so starting there. Yes,
>   my system is likely configured in a pretty non-standard way, having
>   an /etc/nfsmount.conf forcing nfsv3 to avoid a deadlock in earlier
>   systems coming back to bite me as a deadlock in the future :-(
> 
>   other subsystems involved:
>     - systemd
>     - dbus-daemon
>     - autofs
> 
>   ultimately, my goal here is to help establish a robust systemd-capable
>   coordination in the various parts here to avoid another similar issue
>   due to these inter-dependencies.  I don't know if 'start-statd' being
>   re-written to take systemd state into account is the correct solution,
>   but IMHO, systemd/dbus-daemon are utterly fragile in this situation
>   and extremely difficult to debug (need systemctl to do stuff, but it
>   won't work, and you can't restart dbus-daemon w/o systemctl, and kill
>   -TERM on pid 1 doesn't work ...
> 
> Summary:
>   - system configured for NFSv3 mounts via /etc/nfsmount.conf
>     note: this was to workaround a bug in NFSv4.[012] that
>     caused deadlocks against NFSv3 servers running Jessie.
>     i do not recall the bug #
>     something changed in a recent Stretch patchlevel as this
>     was working fine up until i patched and rebooted.
>   - systemd unit rpc-statd.service is disabled
>   - automount/autofs -> nfs is called triggering start-statd
>     that makes a 'systemctl start rpc-statd' that takes down
>     dbus-daemon and never completes.
>   - regardless of where the blame lies, it is possible that is wrong to
>     call 'systemctl' from inside 'start-statd' *if* it's being called
>     from a systemd unit itself.
> 
>   If system is configured for NFS v3 mounts via /etc/nfsmount.conf
>   and systemctl unit 'rpc-statd' is disabled, then the automounter
>   creates a chain in boot (at least in our system case) that forcibly
>   tries to run 'systemctl start rpc-statd' via /usr/sbin/start-statd.
> 
>   This results in systemctl call not completing (i don't know if
>   it's because systemctl calls can't be nested or called outside normal
>   startup flow or what), and eventually dbus-daemon stops responding
>   (so it could be a bug that needs to be transferred there).  this locks
>   up the entire boot process.   systemctl calls all timeout.
>   dbus-daemon is sitting in EAGAIN (resource temporarily unavailable)
> 
>   Additionally, i wasn't able to ssh in (even though systemd had started
>   sshd) because of 'pam_motd' in /etc/pam.d/sshd calling update-motd,
>   which also blocked hard and never completed and was uninterruptable.
>   once i commented 'pam_motd' out, i could ssh in, and <CTRL>C something
>   hanging on nfs to get a shell. (again, tower of fail)
> 
>   once in, if i killed the 'systemctl start rpc-statd', the system would
>   return to responsiveness. (systemctl could again contact dbus-daemon)
> 
>   systemd-cgls showed:
> 
>   +-autofs.service
>     | +-1453 /usr/sbin/automount --pid-file /var/run/autofs.pid
>       | +-1465 /bin/mount -t nfs -s -o intr,nodev,nosuid
>       ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
>         | +-1466 /sbin/mount.nfs
> 	ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
> 	-s -o rw,nodev,nosuid,intr
> 	  | +-1467 /bin/sh /usr/sbin/start-statd
> 	    | -1470 systemctl start rpc-statd.service
> 	            ^^^^ this hangs dbus-daemon and brings down the
> 		    whole systemd kingdom.
> 
>   before it hung, ...
>     puffin:/etc/default/grub.d# systemctl list-jobs
>     TYPE  STATE  
>     607 apt-daily.service              start running
>     462 nfs-config.service             start running
>     468 apt-daily-upgrade.service      start waiting
>     460 rpc-statd-notify.service       start waiting
>     453 rpc-statd.service              start waiting
>     464 systemd-tmpfiles-clean.service start running
> 
> 
> 
> Note: 'ral-local-linux' is our NFS-shared /usr/local.  this may have
> been triggered early due to 'cron' being started and user '@reboot' jobs
> launching.
> 
> Note: i have a lot of systemd debug and other captured logs i can
> provide if needed.
> 
> here's the /etc/nfsmount.conf that was being used prior:
> 	[ NFSMount_Global_Options ]
> 	     nfsvers=3

In sense of BTS housekeeping I'm closing this older bugreport,
assuming it does not happen anymore with a recent supported version.
But by all means, if you still can reproduce the problem with newer
nfs-utils versions in stable or above, please do reopen the bug.

Regards,
Salvatore

--- End Message ---

Reply to:

Prev by Date: Processed: closing 821267
Next by Date: Processed: closing 758544
Previous by thread: Processed: closing 821267
Next by thread: Processed: closing 758544
Index(es):
- Date
- Thread