[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Apache goes haywire, NSCD fixes it. Odd (solved) problem.



Hi,

I have recently surmounted an odd problem involving Apache 1.3.27.0-2
on Debian testing/unstable (with a 2.4.22 kernel) and thought it may be
worthwhile to explain what happened to a group of people who might care,
in hopes that others find the information useful.

A few days ago my machine, which is a Dell P4 2.4 GHz box (pretty generic),
started experiencing extremely high load averages and CPU usage. A couple of
apache processes (anywhere from 1 to 5) had decided to eat 100% of CPU and
100% of RAM (1G physical 1G swap) between themselves (evenly distributed)
for no apparent reason. Restarting apache would temporarily fix these
runaway processes for anywhere from 30 seconds to 12 hours. Typically it
would start up again within 10 or 15 minutes.

Not being able to figure out what the hell these processes were doing by
examining logs and the server-status page, I started stracing the runaway
pids. The output looked like this:

<-- strace output -->
# strace -p <pid of a runaway apache>
close(7)                                = 0
mremap(0x44735000, 304750592, 304754688, MREMAP_MAYMOVE) = 0x44735000
open("/etc/passwd", O_RDONLY)           = 7
fcntl64(7, F_GETFD)                     = 0
fcntl64(7, F_SETFD, FD_CLOEXEC)         = 0
_llseek(7, 0, [0], SEEK_CUR)            = 0
fstat64(7, {st_mode=S_IFREG|0644, st_size=4158, ...}) = 0
mmap2(NULL, 4158, PROT_READ, MAP_SHARED, 7, 0) = 0x403f4000
_llseek(7, 4158, [4158], SEEK_SET)      = 0
fstat64(7, {st_mode=S_IFREG|0644, st_size=4158, ...}) = 0
munmap(0x403f4000, 4158)                = 0
close(7)                                = 0
<repeat>
<-- end strace -->

That would happen over and over about a hundred thousand times until the
system ran out of memory and the process was killed by the kernel.

I was stuck at this point. Why the hell is apache freaking out and eating
tons of CPU while attempting to open /etc/passwd? Who knows? You wouldn't
believe the amount of troubleshooting I did. New kernels, new source
compiled apache binaries with default configurations, confirmed md5 sums
of all apache binaries and linked libraries against a problem-free machine,
a complete hardware swap, and even a run with apache-ssl. I tried everything
and this was still happening.

About to give up hope, I started looking at strace output again. I didn't
notice this before:
<-- more strace -->
connect(7, {sa_family=AF_UNIX, path="/var/run/.nscd_socket"}, 110) = -1 \
ENOENT (No such file or directory)
close(7)                                = 0
open("/etc/passwd", O_RDONLY)           = 7
<-- end strace -->

Hmm.. It.s trying to connect to NSCD before trying (and failing) to open
/etc/passwd. It only was trying NSCD about once every 50 attempts.

Not knowing what NSCD even is I Googled around and found out it provides
cached passwd lookups (among other things). That sounded promising!
Installed NSCD and, well, look at the graphs yourself:

http://discore.org/cpusum-day.png
(Daily CPU usage % updated every 5 min)

NSCD was installed just after 3PM and the problem magically disappeared.

Bug? Maybe. Error on my part? More likely. Odd? I think so. Fun? Not at all.

I still don't know why apache couldn't just open /etc/passwd to get the
information that NSCD provides more readily. The permissions are fine
and nothing but apache is having problems.

I'd be available to do further testing of this if anyone is interested.
Mostly the point of this post is just to provide information on something
that happened to me and was fixed quite easily after far too much
troubleshooting.

Thanks for reading,
Tyler Morgan



Reply to: