[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Open File Limit



Sorry for the cross-post, but I wasn't sure what audience should get this.

I run a web-/email-/name-server that does about .5-1.0 mbps.  My memory
usage is always below 50%, CPU usage is minimal (below 10%).  I have a
custom script that puts data into rrdtool, so I see my CPU/Memory usage up
to within 5 minutes of the crash.

However, every morning, I run a Perl script that processes all of my apache
logs into webalizer.  I believe it is this script, that cause me to get
several errors in the /var/log/daemon.log (The cron.daily is run at
06:25:00):
May  1 06:32:40 a-web inetd[17351]: getpwnam: mail: No such user
May  1 06:32:42 a-web inetd[17354]: getpwnam: cyrus: No such user
<-- several logs removed, just named information -->
May  1 08:45:55 a-web inetd[13332]: execv /usr/sbin/exim: Too many open
files in system
May  1 08:45:56 a-web inetd[13336]: execv /usr/sbin/exim: Too many open
files in system
May  1 08:46:24 a-web pop3d[13433]: connect from x.x.x.x
May  1 08:46:24 a-web pop3d[13433]: error: cannot execute /usr/sbin/pop3d:
Too many open files in system
May  1 08:48:23 a-web proftpd[13841]: connect from x.x.x.x
May  1 08:48:23 a-web proftpd[13841]: error: cannot execute
/usr/sbin/proftpd: Too many open files in system
May  1 08:49:26 a-web pop3d[14036]: connect from x.x.x.x
May  1 08:49:26 a-web pop3d[14036]: error: cannot execute /usr/sbin/pop3d:
Too many open files in system
May  1 08:49:35 a-web pop3d[14068]: connect from x.x.x.x
May  1 08:49:35 a-web pop3d[14068]: error: cannot execute /usr/sbin/pop3d:
Too many open files in system
May  1 08:50:05 a-web pop3d[14164]: connect from x.x.x.x
May  1 08:50:26 a-web pop3d[14225]: connect from x.x.x.x
May  1 08:50:26 a-web pop3d[14225]: error: cannot execute /usr/sbin/pop3d:
Too many open files in system
May  1 08:51:05 a-web pop3d[14346]: connect from x.x.x.x
May  1 08:51:05 a-web pop3d[14346]: error: cannot execute /usr/sbin/pop3d:
Too many open files in system
May  1 15:51:14 a-web inetd[14372]: getpwnam: mail: No such user
May  1 15:51:27 a-web inetd[14393]: getpwnam: cyrus: No such user
May  1 15:51:31 a-web inetd[14400]: getpwnam: cyrus: No such user
May  1 15:51:44 a-web inetd[14419]: getpwnam: cyrus: No such user
May  1 15:51:50 a-web inetd[14430]: getpwnam: cyrus: No such user

During this time, websites are available.  FTP, SSH, and email are down.  I
don't know why there is a 7 hour jump in logs (I did not remove any logs
between 08:51 and 15:51).  When I got the machine rebooted, it wasn't even 9
am (and this message will be sent by 10:30 am).

Now, my named (later this morning) says:
May  1 09:10:51 a-web named[234]: limit files set to fdlimit (1024)

However, I have changed my:
	/usr/src/(kernel version)/include/linux/limits.h
	/usr/include/linux/limits.h
to have:
	#define NR_OPEN         2048
I then packaged up the kernel using make-kpg kernel_image and installed it.

I have also changed my /etc/security/limits.conf to have:
*               soft    nofile  2048
*               hard    nofile  2048

My ulimit -a reports:
core file size        (blocks, -c) 0
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
max locked memory     (kbytes, -l) unlimited
max memory size       (kbytes, -m) unlimited
open files                    (-n) 2048
pipe size          (512 bytes, -p) 8
stack size            (kbytes, -s) 8192
cpu time             (seconds, -t) unlimited
max user processes            (-u) 7168
virtual memory        (kbytes, -v) unlimited

So according to ulimit I have effectively doubled my open file limit (now
2048 when the default is 1024)...  But you can see that at least named still
thinks the limit is 1024.

Anyways, even after changing this limit, it continues to crash every
morning.  If I run the webalizer script by shell, it does not crash.  In
fact, If I run the cron.daily scripts one-at-a-time the server doesn't
crash.

Info that might be helpful:
Debian version: stable woody 3.0
kernel version: 2.4.18
libc6 version: 2.2.5-11.5
apache version: 1.3.26-0woody3
bind version: 8.3.3-2.0woody
webalizer version: 2.01.10-2
cronolog version: 1.6.1-0.1
ram: 1 gigabyte
swap: 1 gigabyte
cpu: Intel Pentium 3 1.3 ghz

So my questions are:
1) How do I fix this situation :).
2) Is there a way to see what the current number of real open files are?
lsof reports all open sockets, etc, so I'm not sure how.  I'm thinking if
every couple of seconds or so if I could capture the process and current
open files maybe I can determine the problem.
I'm guessing the proper way is:
lsof | grep REG | wc -l
   1809

Which is a little odd, because the limit used to be 1024 and we didn't have
problems during the day.  So this is probably an incorrect.

Thanks,

Matthew Walkup



Reply to: