[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

NFS hangs on unlink/safe_remove



Hello list,
(please CC me, I am not subscribed here)

I have got a problem with one of our NFS exports which is hard to debug and to reproduce.

Setup (simplized):
- Two powerful VMware ESX hosts connected (crossconnection) with 4 x 1 GBit/s
- One connection is only reserved for NFS traffic
- A NFS server, which stores its data on a DRBD volume (Wheezy and NFSv4 with RPCNFSDCOUNT=50) - The problematic client (there are two dozen clients) is Debian Squeeze. In this case the NFS server and the client are on the same dedicated host machine
- Client mounts the share with:
192.168.55.31:/var /srv/nfs/magento_var nfs _netdev,auto,soft,intr,rw,noatime,nodiratime

So what happens?

Sometimes if the application is flushing its cache the process hangs for hours. Accessing the NFS and the process itself (for example with strace -p XXXX) is not possible (strace had to be killed with -9). Reproducing the script with strace until it happens I see that it hangs on this operation:

stat("/srv/nfs/magento_var/cache/mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_", {st_mode=S_IFREG|0600, st_size=112, ...}) = 0
unlink("/srv/nfs/magento_var/cache/mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_"

On the client with sunrpc.nfs_debug=1023 I see this:

[1545947.053884] NFS: nfs_update_inode(0:11/5243170 ct=1 info=0x27e7f)
[1545947.053886] NFS: nfs_lookup_revalidate(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_) is valid [1545947.053890] NFS: dentry_delete(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_, 8)
[1545947.054006] NFS: permission(0:11/3276803), mask=0x1, res=0
[1545947.054009] NFS: nfs_lookup_revalidate(var/cache) is valid
[1545947.054011] NFS: permission(0:11/3278035), mask=0x1, res=0
[1545947.054012] NFS: nfs_lookup_revalidate(cache/mage--a) is valid
[1545947.054014] NFS: permission(0:11/5243102), mask=0x1, res=0
[1545947.054017] NFS: permission(0:11/5243102), mask=0x1, res=0
[1545947.054019] NFS: nfs_lookup_revalidate(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_) is valid
[1545947.054021] NFS: permission(0:11/5243102), mask=0x3, res=0
[1545947.054023] NFS: unlink(0:11/5243102, mage---alphanumericZend_LocaleC_de_DE_currencynumber_) [1545947.054025] NFS: safe_remove(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_)

So it seems to happen sometimes while the script is removing files.

The curious is also, that other clients do not have any problems (also Squeeze) to access the data at the same time on the same share. There is no noticeable load (network, CPU, HDD, HDD latency, DRBD, NFS etc) on the machines. Both packagefilters have ACCEPT all for incoming and outgoing traffic on this dedicated interface. No errors, dropped or overruns on this NIC.

Do you have got an idea how to better debug it or what the main problem could be?

--
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

  Blog: http://www.linux-dev.org/
E-Mail: pmatthaei@debian.org
        patrick@linux-dev.org
*/


Reply to: