[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate



Dear devs,

I am attaching a stripped down version of the little program which triggers the bug very quickly, in a few minutes in my test lab. It turned out that a single NFS mountpoint is enough. Just start the program giving it the NFS mount as first argument. It will chdir there, and do file operations, which will trigger a lockup in a few minutes.

Please take a look at it.

Thanks in advance,
Richard

2024-05-23 14:12 időpontban Richard Kojedzinszky ezt írta:
Dear devs,

Now bisecting turned out that 3c59366c207e4c6c6569524af606baf017a55c61 is the bad commit for me. Strangely it only affects my dovecot process accessing data over NFS.

Can you please confirm that this may be a bad commit?

My earlier attached programs may be used to demonstrate/trigger the issue. It even could be stripped down to minimal operations to trigger the bug.

Thanks in advance,
Richard


2024-05-23 09:10 időpontban Richard Kojedzinszky ezt írta:
Dear NFS developers,

I am running multiple PODs on a Kubernetes node, they all mount different NFS shares from the same nfs server. I started to notice hangups in my dovecot process after I switched to Debian's kernel from upstream 5.15. You can find Debian bugreport at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.

So, effectively I am running dovecot in Kubernetes, and dovecot's data directory is accessed over NFS. Eventually one dovecot process stucks in nfs4_lookup_revalidate(). From that point, that process cannot be killed, howewer, other processes can access NFS as normal. Also, another dovecot process running on the very same node accessing the same NFS share works too.

Now, I am still in the process of bisecting, howewer, I cannot reliably trigger the bug. Originally it took a few days after I've noticed a hanging process. Now I am trying to mimic file operations what dovecot does in a faster way. Now it seems that it triggers the bug in a few hours, howewer, during bisects, I can still make mistakes.

I've scheduled many of my applications which use NFS shares to the same node, to have more NFS load on that node.

I am attaching my simple app which triggers the bug in a few hours, at least in my lab. I have two dedicated NFS shares for this test case, and I am running 3 instances of the applications for both shares. Also, I am running other production applications on the same node which also use NFS, howewer, I dont experience lockups with them. They are librenms, prometheus, and a docker private registry. This way I dont know if running the attached app only is enough to trigger the bug.

Once I have a suspectible commit based on my bisecting process, I will report it here.

My NFS server is a TrueNAS, based on FreeBSD 13.3.

Thanks in advance,
Richard

Attachment: ds.tar
Description: Unix tar archive


Reply to: