Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

To: linux-nfs@vger.kernel.org, 1071501@bugs.debian.org
Subject: Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate
From: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in>
Date: Thu, 23 May 2024 14:12:02 +0200
Message-id: <[🔎] 73e081764d06746be27c5f0d2f181938@kojedz.in>
Reply-to: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in>, 1071501@bugs.debian.org
In-reply-to: <[🔎] 0473c552b6fd8e96ef2ffbf0435a7552@kojedz.in>
References: <[🔎] 0473c552b6fd8e96ef2ffbf0435a7552@kojedz.in> <[🔎] 171619724421.12490.10588035153055943112.reportbug@reportbug-6bf8b7fbdc-jccqf>

Dear devs,

Now bisecting turned out that 3c59366c207e4c6c6569524af606baf017a55c61is the bad commit for me. Strangely it only affects my dovecot processaccessing data over NFS.


Can you please confirm that this may be a bad commit?

My earlier attached programs may be used to demonstrate/trigger theissue. It even could be stripped down to minimal operations to triggerthe bug.


Thanks in advance,
Richard


2024-05-23 09:10 időpontban Richard Kojedzinszky ezt írta:

Dear NFS developers,
I am running multiple PODs on a Kubernetes node, they all mountdifferent NFS shares from the same nfs server. I started to noticehangups in my dovecot process after I switched to Debian's kernel fromupstream 5.15. You can find Debian bugreport athttps://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.
So, effectively I am running dovecot in Kubernetes, and dovecot's datadirectory is accessed over NFS. Eventually one dovecot process stucksin nfs4_lookup_revalidate(). From that point, that process cannot bekilled, howewer, other processes can access NFS as normal. Also,another dovecot process running on the very same node accessing thesame NFS share works too.
Now, I am still in the process of bisecting, howewer, I cannot reliablytrigger the bug. Originally it took a few days after I've noticed ahanging process. Now I am trying to mimic file operations what dovecotdoes in a faster way. Now it seems that it triggers the bug in a fewhours, howewer, during bisects, I can still make mistakes.
I've scheduled many of my applications which use NFS shares to the samenode, to have more NFS load on that node.
I am attaching my simple app which triggers the bug in a few hours, atleast in my lab. I have two dedicated NFS shares for this test case,and I am running 3 instances of the applications for both shares. Also,I am running other production applications on the same node which alsouse NFS, howewer, I dont experience lockups with them. They arelibrenms, prometheus, and a docker private registry. This way I dontknow if running the attached app only is enough to trigger the bug.
Once I have a suspectible commit based on my bisecting process, I willreport it here.
My NFS server is a TrueNAS, based on FreeBSD 13.3.

Thanks in advance,
Richard

Reply to:

References:
- Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate
  - From: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in>
- Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate
  - From: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in>

Prev by Date: Bug#1063754: fat-modules: SD corruption upon opening file on Linux desktop
Next by Date: Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate
Previous by thread: Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate
Next by thread: Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate
Index(es):
- Date
- Thread