intermittent mountd problems
I'm going crazy here. There are two 2.0.38 slink machines serving NFS
at our site, and every few weeks they stop allowing new mounts. I've
searched exhaustively on the net without finding anything, and I was
hoping maybe someone here had encountered a similar problem.
Incidently, we had the same problem with a 2.0.36 kernel, and it
always coincided with syslog messages flagging possible syn attacks.
We thought that maybe it was a kernel bug, so upgraded and disabled
the syn cookie option, which seemed to fix the problem at the time,
but apparently hasn't.
We can go for weeks without a single blip, and then spend days with
NFS more down than up. Our users are understandably upset, and so are
we. I'm having trouble keeping the FreeBSD bigots at bay.
Problem description:
- the NFS daemons (nfsd and mountd) don't die or freeze, but don't
service requests; thus in debug mode mountd gives
Feb 15 15:36:05 square mountd[432]: mnt [1 100/2/15 15:36:05 yoda.cnd.mcgill.ca 0.0+0,10]
Feb 15 15:36:05 square mountd[432]: ^I/exports/u0
Feb 15 15:36:05 square mountd[432]: NFS mount of /exports/u0 attempted from 132.206.114.131
and sometimes even prints a line claiming success, but on the client
the mount always returns "RPC: Timed out" while the problem is ongoing
- usually it'll be just mountd that screws up, and already mounted
filesystems are fine; it's unclear to me whether nfsd also gets
hosed by itself, or if it happens as a result of trying to get mountd
running again; I suspect the latter
- sometimes restarting nfs-server helps; often we actually have to
reboot to get a quick fix, but even that doesn't last long on some
days, and sometimes it doesn't even seem to help at all; sometimes
multiple kill/restart sequences seem to work, though why that should
be I don't know
I've tried leaving long straces running on mountd, I've tried sniffing
the network to see if we're being hurt by something else on the
network, and can't see anything noteworthy. The only oddity I've
noticed recently is that an rpcinfo -p of the server lists two unnamed
services I can't track down (any theories are welcome - I'm pretty
ignorant of rpc stuff generally):
600100069 1 udp 773
600100069 1 tcp 775
Also, this may be incidental, but the behaviour occurs much less
commonly on the server we have named gloom, which is on a different
network and subject to different traffic. Fixing it on gloom is
usually more pressing and more lasting, so I haven't had a chance to
investigate as thoroughly there.
---
p.s. We wanted to try running nfsd/mountd from inetd to see if that
helped matters in the short term, but inetd seemed to only be willing
to register the udp or the tcp servers, not both. Is this a known
problem? Lines used were from the manpages, i.e.
mount/1-2 dgram rpc/udp wait root /usr/sbin/rpc.mountd rpc.mountd
mount/1-2 stream rpc/tcp wait root /usr/sbin/rpc.mountd rpc.mountd
nfs/2 dgram rpc/udp wait root /usr/sbin/rpc.nfsd rpc.nfsd
nfs/2 stream rpc/tcp wait root /usr/sbin/rpc.nfsd rpc.nfsd
Reply to: