Dear NFS developers,
I am running multiple PODs on a Kubernetes node, they all mount
different NFS shares from the same nfs server. I started to notice
hangups in my dovecot process after I switched to Debian's kernel from
upstream 5.15. You can find Debian bugreport at
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.
So, effectively I am running dovecot in Kubernetes, and dovecot's data
directory is accessed over NFS. Eventually one dovecot process stucks
in nfs4_lookup_revalidate(). From that point, that process cannot be
killed, howewer, other processes can access NFS as normal. Also,
another dovecot process running on the very same node accessing the
same NFS share works too.
Now, I am still in the process of bisecting, howewer, I cannot reliably
trigger the bug. Originally it took a few days after I've noticed a
hanging process. Now I am trying to mimic file operations what dovecot
does in a faster way. Now it seems that it triggers the bug in a few
hours, howewer, during bisects, I can still make mistakes.
I've scheduled many of my applications which use NFS shares to the same
node, to have more NFS load on that node.
I am attaching my simple app which triggers the bug in a few hours, at
least in my lab. I have two dedicated NFS shares for this test case,
and I am running 3 instances of the applications for both shares. Also,
I am running other production applications on the same node which also
use NFS, howewer, I dont experience lockups with them. They are
librenms, prometheus, and a docker private registry. This way I dont
know if running the attached app only is enough to trigger the bug.
Once I have a suspectible commit based on my bisecting process, I will
report it here.
My NFS server is a TrueNAS, based on FreeBSD 13.3.
Thanks in advance,
Richard