[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Diagnosing occassional random reboots



A server which has been running steadily for years is beginning to reboot. To the best of my knowledge, nothing has changed. It is a dual-processor PIII. It runs stable.

It is tucked away in the loft and usually has no monitor attached so tracking this down is difficult. However even if I brought it into a more convenient area, short of sitting staring at the screen waiting for a crash or reboot, I'm not sure it would help much.

I've tried rebuilding a newer kernel from backports.org. And trimmed it right down as much as possible. There is nothing useful in syslog. A typical series of reboots looks like:

dougie   pts/0        tbird2xp:0.0     Tue Oct 31 17:15   still logged in
runlevel (to lvl 2)   2.6.17           Tue Oct 31 17:12 - 17:21  (00:08)
reboot   system boot  2.6.17           Tue Oct 31 17:12          (00:08)
dougie   pts/0        tbird2xp:0.0     Tue Oct 31 17:09 - crash  (00:02)
runlevel (to lvl 2)   2.6.17           Tue Oct 31 16:59 - 17:12  (00:12)
reboot   system boot  2.6.17           Tue Oct 31 16:59          (00:21)
dougie   pts/0        tbird2xp:0.0     Tue Oct 31 16:05 - crash  (00:54)
runlevel (to lvl 2)   2.6.17           Tue Oct 31 15:16 - 16:59  (01:43)
reboot   system boot  2.6.17           Tue Oct 31 15:16          (02:04)
date     new time                      Sun Oct 29 07:11
date     old time                      Sun Oct 29 07:12
root     pts/3        kitchens         Sun Oct 29 07:11 - crash (2+08:04)
dougie   pts/2        kitchens         Sat Oct 28 20:29 - crash (2+19:46)
dougie   pts/1        kitchens         Sat Oct 28 11:37 - 16:04 (1+05:27)
dougie   pts/0        tbird2xp:0.0     Fri Oct 27 13:16 - crash (4+03:00)


And the syslog shows nothing notable around the time. Usuall just lines from postfix as it processes the mail queue, then:

Oct 31 17:12:22 nick syslogd 1.4.1#17: restart (remote reception).
Oct 31 17:12:22 nick kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Oct 31 17:12:23 nick kernel: Inspecting /boot/System.map-2.6.17
Oct 31 17:12:23 nick kernel: Loaded 21314 symbols from /boot/System.map-2.6.17.

I'm not sure how to go about tracking this down. My searching of the archives shows that these symptoms could describe a faulty physical component, such as memory or PSU. So my next step is probably going to be trying to swap the PSU and doing a memtest. One thing about the reboots is that they often appear to be in clusters. For example, around 7AM to 9AM on Oct 24 it looks like it was bouncing for about two hours off and on:

# last reboot
reboot   system boot  2.6.8            Wed Oct 25 05:03          (06:50)
reboot   system boot  2.6.8            Wed Oct 25 04:31          (07:22)
reboot   system boot  2.6.8            Tue Oct 24 11:09         (1+00:44)
reboot   system boot  2.6.8            Tue Oct 24 10:59          (00:06)
reboot   system boot  2.6.8            Tue Oct 24 09:52          (01:01)
reboot   system boot  2.6.8            Tue Oct 24 09:50          (01:03)
reboot   system boot  2.6.8            Tue Oct 24 09:49          (01:05)
reboot   system boot  2.6.8            Tue Oct 24 09:37          (01:17)
reboot   system boot  2.6.8            Tue Oct 24 09:05          (01:49)
reboot   system boot  2.6.8            Tue Oct 24 08:53          (02:00)
reboot   system boot  2.6.8            Tue Oct 24 08:51          (02:03)
reboot   system boot  2.6.8            Tue Oct 24 07:28          (03:26)
reboot   system boot  2.6.8            Tue Oct 24 07:26          (03:27)
reboot   system boot  2.6.8            Tue Oct 24 07:24          (03:29)
reboot   system boot  2.6.8            Tue Oct 24 07:01          (03:52)
reboot   system boot  2.6.8            Tue Oct 24 06:18          (04:36)

I'm a bit stumped on how to solve this and would appreciate any thoughts on strategy.

Dougie



Reply to: