[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

intermittent mountd problems



I'm going crazy here.  There are two 2.0.38 slink machines serving NFS
at our site, and every few weeks they stop allowing new mounts.  I've
searched exhaustively on the net without finding anything, and I was
hoping maybe someone here had encountered a similar problem.

Incidently, we had the same problem with a 2.0.36 kernel, and it
always coincided with syslog messages flagging possible syn attacks.
We thought that maybe it was a kernel bug, so upgraded and disabled
the syn cookie option, which seemed to fix the problem at the time,
but apparently hasn't.

We can go for weeks without a single blip, and then spend days with
NFS more down than up.  Our users are understandably upset, and so are
we.  I'm having trouble keeping the FreeBSD bigots at bay.

Problem description:

- the NFS daemons (nfsd and mountd) don't die or freeze, but don't 
  service requests; thus in debug mode mountd gives
  
Feb 15 15:36:05 square mountd[432]: mnt [1 100/2/15 15:36:05 yoda.cnd.mcgill.ca 0.0+0,10] 
Feb 15 15:36:05 square mountd[432]: ^I/exports/u0 
Feb 15 15:36:05 square mountd[432]: NFS mount of /exports/u0 attempted from 132.206.114.131

  and sometimes even prints a line claiming success, but on the client
  the mount always returns "RPC: Timed out" while the problem is ongoing
- usually it'll be just mountd that screws up, and already mounted
  filesystems are fine; it's unclear to me whether nfsd also gets
  hosed by itself, or if it happens as a result of trying to get mountd
  running again; I suspect the latter
- sometimes restarting nfs-server helps; often we actually have to
  reboot to get a quick fix, but even that doesn't last long on some
  days, and sometimes it doesn't even seem to help at all; sometimes
  multiple kill/restart sequences seem to work, though why that should
  be I don't know

I've tried leaving long straces running on mountd, I've tried sniffing
the network to see if we're being hurt by something else on the
network, and can't see anything noteworthy.  The only oddity I've
noticed recently is that an rpcinfo -p of the server lists two unnamed
services I can't track down (any theories are welcome - I'm pretty
ignorant of rpc stuff generally):

  600100069    1   udp    773
  600100069    1   tcp    775

Also, this may be incidental, but the behaviour occurs much less
commonly on the server we have named gloom, which is on a different
network and subject to different traffic.  Fixing it on gloom is
usually more pressing and more lasting, so I haven't had a chance to
investigate as thoroughly there.

---

p.s. We wanted to try running nfsd/mountd from inetd to see if that
helped matters in the short term, but inetd seemed to only be willing
to register the udp or the tcp servers, not both.  Is this a known
problem?  Lines used were from the manpages, i.e. 

mount/1-2 dgram  rpc/udp wait  root  /usr/sbin/rpc.mountd rpc.mountd
mount/1-2 stream rpc/tcp wait  root  /usr/sbin/rpc.mountd rpc.mountd

nfs/2 dgram  rpc/udp wait root /usr/sbin/rpc.nfsd rpc.nfsd
nfs/2 stream rpc/tcp wait root /usr/sbin/rpc.nfsd rpc.nfsd



Reply to: