Bug#1093734: nfs-kernel-server: fails to complete setup during upgrade (stuck while restarting nfs-kernel-server.service)
Hi Francesco,
On Wed, Jan 22, 2025 at 09:56:26AM +0100, Salvatore Bonaccorso wrote:
> Control: tags -1 + unreproducible moreinfo
>
> On Wed, Jan 22, 2025 at 12:29:12AM +0100, Francesco Poli (wintermute) wrote:
> > Package: nfs-kernel-server
> > Version: 1:2.8.2-1+b1
> > Severity: grave
> > Justification: causes non-serious data loss
> > X-Debbugs-Cc: invernomuto@paranoici.org
> >
> >
> > Dear maintainers,
> > I encountered a big issue, while upgrading package 'nfs-kernel-server'
> > on the box where the NFS server runs (the clients run on the compute
> > nodes of an HPC cluster).
> >
> > The upgrade:
> >
> > [UPGRADE] nfs-kernel-server:amd64 1:2.8.2-1 -> 1:2.8.2-1+b1
> >
> > got stuck at
> >
> > [...]
> > Setting up nfs-kernel-server (1:2.8.2-1+b1) ...
> >
> >
> >
> > It looks like it was stuck at the restart of the systemd service:
> >
> > # systemctl status nfs-kernel-server.service
> > ● nfs-server.service - NFS server and services
> > Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; prese>
> > Drop-In: /run/systemd/generator/nfs-server.service.d
> > └─order-with-mounts.conf
> > Active: activating (start-pre) since Tue 2025-01-21 12:40:52 CET; 10min ago
> > Job: 97667
> > Invocation: ced460d410fe4059b9e8781b35340d70
> > Docs: man:rpc.nfsd(8)
> > man:exportfs(8)
> > Cntrl PID: 249039 (exportfs)
> > Tasks: 3 (limit: 154102)
> > Memory: 680K (peak: 2.5M)
> > CPU: 10ms
> > CGroup: /system.slice/nfs-server.service
> > ├─239857 /usr/sbin/nfsdctl threads 0
> > ├─239918 /usr/sbin/exportfs -au
> > └─249039 /usr/sbin/exportfs -r
> >
> > There was a 'nfsdctl' process in uninterruptible sleep (D):
> >
> > $ ps -eldaf | grep nf[s]
> > 4 D root 239857 1 0 80 0 - 847 - 12:07 ? 00:00:00 /usr/sbin/nfsdctl threads 0
> > 5 S root 247511 1 0 80 0 - 1375 - 12:35 ? 00:00:00 /usr/sbin/nfsdcld
> >
> > After about 30 min, since trying to kill PID 239857 obviously had no effect,
> > and I could not find any other strategy to restart nfs-kernel-server.service,
> > I had to reboot the box, thus causing many problems to all the NFS clients.
> >
> > After reboot, I could issue:
> >
> > # aptitude --purge-unused safe-upgrade
> >
> > which finally completed the upgrade (fixing the nfs-kernel-server package,
> > which was left in a partially configured state).
> >
> >
> > I have never seen anything like this before, and I have upgraded
> > nfs-kernel-server and related packages on Debian machines for quite
> > a long time.
> > Anyway, this should *not* happen during a system upgrade with
> > aptitude or apt!
> >
> > I don't know whether bug [#992661] is related or not.
> >
> > [#992661]: <https://bugs.debian.org/992661>
> >
> > By looking at /var/log/kern.log , I see that a kernel BUG was traced
> > at the time when the 'nfsdctl' process got stuck in D state.
> > See the attached kern.log snippet.
> >
> > Please investigate and fix the issue as soon as possible.
> > I really hope we can prevent this from happening again!
> >
> > Thanks for your time and dedication.
>
> So I'm not able to reproduce this on a current Debian unstable system
> mimicking the upgrade. *But* it is possible we have some races
> somehwere as recently discussed at our regular kernel team meeting.
>
> We need first to find a way to trigger the issue in any case.
Upstream got an idea on what the problem is and posted a patch.
https://lore.kernel.org/linux-nfs/20250125-kdevops-v1-1-a76cf79127b8@kernel.org/
Regards,
Salvatore
Reply to: