[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Tracing silent crashes



Thanks to all who replied. I was able to take a monitor to the machine and discovered that there was an error in the NTP configuration (I'm using a GPS-disciplined oscillator for the timecode, and was using the kernel PPS interface patches) that was causing some sort of meltdown. I've posted a message with the gory details to the NTP mailing list, so I'll spare you here.

But thanks in particular for the hints on network syslog and using a console terminal. I'm going to implement some combination of those to make future problems easier to solve.

Thanks,

John

--On Sunday, January 18, 2004 14:45:38 +0100 Michael Bergbauer <michael@noname.franken.de> wrote:

On Sun Jan 18, 2004 at 08:3302AM -0500, John Ackermann N8UR wrote:
I have a remote machine running Debian testing and kernel 2.4.21, that
operates in headless mode (no keyboard or monitor attached).  At random
times, it seems to die, at least as far as any network connectivity is
concerned (the NICs are SMC 9342 using the epic100 driver).  It simply
stops responding to any network request.  I have a clue (difficult to
verify because of the remote location) that the machine doesn't actually
crash, and that the local console remains accessible; in other words, it
may just be a freeze of the networking stack.

There doesn't seem to be any correlation to time of day, and sometimes
I'll  go weeks without this happening, when other times it may be a
daily  occurrence.  The machine is on a UPS, so it's probably not power
glitch  related.  I've swapped NIC units, though not varieties.  And,
it's been  happening for a while, though I run apt-get dist-upgrade
fairly regularly,  and across kernel versions, so I don't think it's due
to any new software  change.

Upon reboot things return to normal and there's no trace of anything in
the  logs to indicate what the problem.

I guess I have two questions -- does anyone recognize this problem, and
is  there any way to capture more data that might give me a clue about
what's  happening.  The normal log files don't yield a clue.

Any chance to attach a serial console to the machine? Some serial
concentrator in the rack where you could get plugged in at least for
fixing that bug? Another box of yours in the same rack? So you could
setup this box to support serial console and get all the console output
(includung kernel oopses and panics) + magic sysrequest via the serial
line.


--
Michael Bergbauer <michael@noname.franken.de>
use your idle CPU cycles - See http://www.distributed.net for details.
Visit our mud Geas at geas.franken.de Port 3333


--
To UNSUBSCRIBE, email to debian-isp-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact
listmaster@lists.debian.org







Reply to: