[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

NFS stability and file ownership problem.



Hello,

I've been trying to find the answer to my server's NFS problems for about a
month now but the searching I have done on email list archives and HOWTOs so
far hasn't helped. I don't think the two problems are related except that
they both deal with NFS.

SITUATION:

I run a variety of debian 2.1 and debian 2.2 servers on my network:

file-server-one	debian 2.1	kernel 2.2.14  rpc.nfsd 2.2beta37  automount
ver 3.1.1
file-server-two	debian 2.1	kernel 2.2.14  rpc.nfsd 2.2beta37  automount
ver 3.1.1
file-server-three	debian 2.1	kernel 2.2.14  rpc.nfsd 2.2beta37
automount ver 3.1.1

cpu-server-one	debian 2.2	kernel 2.2.15pre20  rpc.nfsd 2.2beta47
automount ver 3.1.4
cpu-server-two	debian 2.2	kernel 2.2.15pre20  rpc.nfsd 2.2beta47
automount ver 3.1.4
cpu-server-three	debian 2.2	kernel 2.2.15pre20  rpc.nfsd
2.2beta47  automount ver 3.1.4
cpu-server-four	debian 2.2	kernel 2.2.15pre20  rpc.nfsd 2.2beta47
automount ver 3.1.4

As you can see the newer servers are used for running cpu intensive jobs,
and they use their local hard drives for storage of local intermediate data
created during such batch jobs that is also used by the other cpu-server's.
The original data set that such batch jobs use as input is stored on
file-server-three. file-server's one and two are used for the user's
individual home directories.

All machines use NFS to export their data directories and mounts that point
to file-server-one and file-server-two are automounted (via the autofs
package) under /h, whilst the other machine's expored filesystems would be
mounted under /d (eg /d/cpu-server-one-data1). All of this mount information
is distributed via NIS, as well as user/group info.

PROBLEM ONE - NFS STABILITY PROBLEM:

Occasionally my users who are running these batch jobs on cpu-server's will
find that their job grinds to a halt because file-server-three stops
responding too all of the machines. For example, if file-server-three had
stopped responding, then if I ran ls -l /d/file-server-three-data1 from the
shell, it would hang indefinitely.

To get it going again, I would run "/etc/init.d/nfs-server restart" as root
on file-server-three, and then their batch jobs would continue on their
merry way as that mount point would then start responding again.

I've included info about file-server's one and two because as a comparison,
these servers make available the user's home directories (eg
file-server-one:/home/scottb would be mounted under /h/scottb on all of the
linux servers on the network). There doesn't seem to be any stability
problem with these servers at all, and they would make their data
directories available to many more clients running either linux as a server
or workstation, or via samba to win32 clients. 

Additionally, the cpu-server's exported file systems dont' suffer the same
stability problems either - they share their intermediate results of the
batch jobs eg /d/cpu-server-data1 (cpu-server-one:/data1) is available on
the other cpu-servers.

On the problematic file-server-three machine, I've tried to upgrade the
nfs-server deb package but found I would also need to upgrade libraries on
the server so have been reluctant to do so. Currently I have lowered the
rsize and wsize variables that mount uses to mount the drive on other
machines (from 8192 to 4096) and although the number of incidents that the
drive stops responding has fallen a little, the problem has not gone away.

PROBLEM TWO - FILE OWNERSHIP FOR NFS MOUNTED FILE SHARES

This problem only affects the four cpu-servers. On each of these machines, I
have an /etc/exports machine that looks like:

/data1	192.168.0.1/255.255.255.0(rw,root_squash,map_nis=syrinx)

and is identical on each machine (since the servers are essentially
identical except for the serial numbers on the servers themselves :-)

However, when I mount one of the cpu-server-xxx:/data1 filesystems on
another machine, (eg file-server-one or even another cpu-server-yyy
machine), absolutely all of the files belong to "nobody, nogroup" even
though if you look at the files on the local machine they belong to proper
users of the network (eg scottb, users).

Additionally, on the file-server-zzz machines, their /etc/exports file uses
exactly the same options (rw,root_squash,map_nis=syrinx) and when their
mount points are mounted on other linux servers/workstations, the files
contained on them show the right ownership.

So, although I can share the files stored on cpu-server's machines, it can
only ever be read-only at the moment because the system accessing the file
over nfs thinks it is owned by nobody, when that is not correct.

I'm hoping that my problem is not isolated and that someone out there has
had problems similar to mine and has successfully dealt with it. Any
solutions, or even ideas would be helpful. Apart from these two small
issues, I have a great network and file sharing system that needs little
maintenance and keeps going for long periods without breaking.

Regards,
Scott Bragg
Senior System Administrator
Syrinx Speech Systems








Reply to: