NFS locking issues
it was just a couple weeks ago I was trying to help others
on NFS and here I a asking something! doh!
anyways, I noticed recently that my NFS server at home seems to
have trouble with locking. I have 2 clients which use it to host
home directories(1 debian woody, 1 suse 8). I first noticed it about
a week ago when trying to load gnp (gnome notepad, my favorite X editor),
it didn't load, it just hung.. and i was getting this in my local(client)
kernel log:
Aug 25 13:56:37 aphro kernel: lockd: task 173568 can't get a request slot
Aug 25 13:57:59 aphro kernel: lockd: task 173597 can't get a request slot
Aug 25 13:58:49 aphro kernel: lockd: task 173597 can't get a request slot
Aug 25 13:59:39 aphro kernel: lockd: task 173597 can't get a request slot
Aug 25 14:00:29 aphro kernel: lockd: task 173597 can't get a request slot
Aug 25 14:01:19 aphro kernel: lockd: task 173597 can't get a request slot
I was getting this in my server kernel log:
lockd: cannot monitor 10.10.10.10
statd: server localhost not responding, timed out
nsm_mon_unmon: rpc failed, status=-5
lockd: cannot monitor 10.10.10.10
statd: server localhost not responding, timed out
nsm_mon_unmon: rpc failed, status=-5
one website said this is the result of an overloaded server, but I
don't think it's overloaded with only 2 clients(usually only 1 of which
are using it at a time since these systems are on the same KVM). I
can usually work around it short term by restarting the NFS services ..
not many apps seem to be affected by it. gnome-terminal works fine, afterstep
is fine, mozilla and opera are fine, staroffice 6 is fine, I can only
assume that they either don't care for locking or do it in another
manor.
I have the NFS server(debian 3.0 / 2.2.19 / using kernel NFS) set
to load 19 NFS servers, it also loads the lockd service(kernel level):
(querying the server from the client):
[root@aphro:~]# rpcinfo -p gateway
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 19662 status
100024 1 tcp 7617 status
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100021 1 udp 19663 nlockmgr
100021 3 udp 19663 nlockmgr
100021 4 udp 19663 nlockmgr
100005 1 udp 19664 mountd
100005 1 tcp 7618 mountd
100005 2 udp 19664 mountd
100005 2 tcp 7618 mountd
100005 3 udp 19664 mountd
100005 3 tcp 7618 mountd
(querying the client from the client):
[root@aphro:~]# rpcinfo -p
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100021 1 udp 1024 nlockmgr
100021 3 udp 1024 nlockmgr
100021 4 udp 1024 nlockmgr
100024 1 udp 1025 status
100024 1 tcp 1025 status
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100005 1 udp 1026 mountd
100005 1 tcp 1026 mountd
100005 2 udp 1026 mountd
100005 2 tcp 1026 mountd
100005 3 udp 1026 mountd
100005 3 tcp 1026 mountd
running nfsstat on the server shows the following results:
Server rpc stats:
calls badcalls badauth badclnt xdrcall
11900099 1420 0 1420 0
Server nfs v3:
null getattr setattr lookup access readlink
15 0% 7292735 61% 171766 1% 625793 5% 1426891 11% 389 0%
read write create mkdir symlink mknod
830197 6% 1053611 8% 150175 1% 2889 0% 979 0% 3 0%
remove rmdir rename link readdir readdirplus
132602 1% 3179 0% 1195 0% 333 0% 18594 0% 2901 0%
fsstat fsinfo pathconf commit
395 0% 305 0% 0 0% 185152 1%
(I have the clients mounting the filesystem with the option nfsvers=3)
my next thing to try is to switch to nfsvers=2 and see if it helps
at all. (all other stats reported by nfsstat are 0)
all 3 machines are on the same VLAN of my Summit 48-port switch, with
a 17gig backplane I am certain there is no bandwidth issues. one website
reccomended doing a ping -f to the server/client and see if there
is packet loss, I did it anyways just to see the results:
server to client:
--- aphro.aphroland.org ping statistics ---
60496 packets transmitted, 60494 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/3.4 ms
client to server:
--- gateway.aphroland.org ping statistics ---
78989 packets transmitted, 78983 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.2/44.0 ms
server is:
P3-800
1GB ram
dual western digitial 100GB Special edition(8MB cache each) drives in raid1
2.2.19 kernel
client1 is:
Athlon 1300
768MB ram
9.1GB ultrawide SCSI disk
2.2.19 kernel
client2 is:
P3-500
512MB ram
12GB IBM IDE disk
2.4.18 kernel
one thing that is curious, is I ran an lsof to see the open ports used
by rpc.statd, it is using 2 at the moment, one of which is 7617/udp. I
ran a UDP nmap scan against localhost and nmap reported that port was
closed. I ran a nmap scan against that same port from my client and it
reported the port open. my firewalling rules only affect the eth0 interface,
so I am not sure why statd stops responding to localhost connecitons
which seems to be the heart of the problem ?
my rpc firewall rules:
PORTS="`rpcinfo -p | awk '{print $4}' | grep '[0-9]'`"
for rpcport in $PORTS
do
/sbin/ipchains -A input -s 0/0 -d 0/0 $rpcport -j REJECT -p tcp -i eth0
/sbin/ipchains -A input -s 0/0 -d 0/0 $rpcport -j REJECT -p udp -i eth0
done
the 2nd port that rpc.statd is listening on(807/UDP) is reported to
be open by a UDP nmap scan against localhost on the server.
[root@portal:/etc/init.d]# nmap -sU -vv -p 807,7617 localhost
Starting nmap V. 2.54BETA31 ( www.insecure.org/nmap/ )
Host debian (127.0.0.1) appears to be up ... good.
Initiating UDP Scan against debian (127.0.0.1)
The UDP Scan took 2 seconds to scan 2 ports.
Adding open port 807/udp
Interesting ports on debian (127.0.0.1):
(The 1 port scanned but not shown below is in state: closed)
Port State Service
807/udp open unknown
Nmap run completed -- 1 IP address (1 host up) scanned in 2 seconds
thanks for any ideas!
nate
Reply to: