Tracing silent crashes
I have a remote machine running Debian testing and kernel 2.4.21, that
operates in headless mode (no keyboard or monitor attached). At random
times, it seems to die, at least as far as any network connectivity is
concerned (the NICs are SMC 9342 using the epic100 driver). It simply
stops responding to any network request. I have a clue (difficult to
verify because of the remote location) that the machine doesn't actually
crash, and that the local console remains accessible; in other words, it
may just be a freeze of the networking stack.
There doesn't seem to be any correlation to time of day, and sometimes I'll
go weeks without this happening, when other times it may be a daily
occurrence. The machine is on a UPS, so it's probably not power glitch
related. I've swapped NIC units, though not varieties. And, it's been
happening for a while, though I run apt-get dist-upgrade fairly regularly,
and across kernel versions, so I don't think it's due to any new software
Upon reboot things return to normal and there's no trace of anything in the
logs to indicate what the problem.
I guess I have two questions -- does anyone recognize this problem, and is
there any way to capture more data that might give me a clue about what's
happening. The normal log files don't yield a clue.