[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: System crashes for no apparent reason



On Wed, Jun 06, 2012 at 06:14:33AM +0100, Marc Shapiro wrote:
> I am running a reasonably up-to-date Squeeze box (it has been a few days 
> since I did an aptitude update and safe-upgrade).  This problem has 
> actually been occurring sporadically for a few weeks now.  The system 
> will simply die.  I leave the computer on 24/7, normally without 
> problems for weeks, or months, but lately that has not been working.
> 
> A few weeks back I had hard disk failure.  After I replaced that disk 
> and reinstalled Squeeze (and /home and /usr/local from the old disk) 
> several problems started occurring.  The dying system is the most 
> annoying.  I will wake up, or come home from work and find the monitor 
> showing it's "No Signal" screen and the box is totally unresponsive.  I 
> can not ssh in, or anything.  The system is completely down.  This 
> happenned every couple of days for a while and then it seemed to go 
> away.  For at week, or more, I had no trouble.  Then, yesterday evening 
> it died.  After verifying that I could not ssh in, I rebooted.  This 
> evening, I came home from work and spent a few minutes at the computer, 
> then had to go out at about 6:20PM.  When I came back the system was 
> dead, again.  I rebooted.  This time I went through syslog and copied 
> out the parts just before the system died each of the last two times.  I 
> generally try to avoid having to look at logs, but...  I have inserted 
> those sections of log below.  The only similarity that I can find is 
> that, in each case the last thing that shows in the log was running 
> cron.hourly.  The only problem is that there is nothing in 
> /etc/cron.hourly/ (it is an empty directory).  There are some entries 
> concerning the nouveau graphiocs driver.  I know nothiong about drivers, 
> but I'm sure someone out there does.  Is it pointing to a problem?
> 
> Anyway, here are the log snippets:
> 
> Jun  4 07:47:39 xander rsyslogd: [origin software="rsyslogd" 
> swVersion="4.6.4" x-pid="1246" x-info="http://www.rsyslog.com";] rsyslogd 
> was HUPed, type 'lightweight'.
> Jun  4 07:47:44 xander anacron[23172]: Job `cron.daily' terminated
> Jun  4 07:47:44 xander anacron[23172]: Normal exit (1 job run)
> Jun  4 07:52:04 xander kernel: [755950.550970] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  4 08:17:01 xander /USR/SBIN/CRON[23483]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 08:31:09 xander kernel: [758296.121147] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
> Jun  4 09:09:23 xander kernel: [760590.284707] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  4 09:17:01 xander /USR/SBIN/CRON[23688]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 10:17:01 xander /USR/SBIN/CRON[23854]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 10:29:26 xander dhclient: DHCPREQUEST on eth0 to 192.168.1.1 port 67
> Jun  4 10:29:26 xander dhclient: DHCPACK from 192.168.1.1
> Jun  4 10:29:26 xander dhclient: bound to 192.168.1.2 -- renewal in 
> 42208 seconds.
> Jun  4 11:17:01 xander /USR/SBIN/CRON[24009]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 12:17:01 xander /USR/SBIN/CRON[24169]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 13:17:01 xander /USR/SBIN/CRON[24300]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  4 18:30:00 xander kernel: imklog 4.6.4, log source = /proc/kmsg 
> started.
> Jun  4 18:30:00 xander rsyslogd: [origin software="rsyslogd" 
> swVersion="4.6.4" x-pid="1276" x-info="http://www.rsyslog.com";] (re)start

The above looks pretty uneventful...

> and
> 
> Jun  5 08:04:41 xander rsyslogd: [origin software="rsyslogd" 
> swVersion="4.6.4" x-pid="1276" x-info="http://www.rsyslog.com";] rsyslogd 
> was HUPed, type 'lightweight'.
> Jun  5 08:04:41 xander rsyslogd: [origin software="rsyslogd" 
> swVersion="4.6.4" x-pid="1276" x-info="http://www.rsyslog.com";] rsyslogd 
> was HUPed, type 'lightweight'.
> Jun  5 08:05:08 xander anacron[6109]: Job `cron.daily' terminated
> Jun  5 08:05:08 xander anacron[6109]: Normal exit (1 job run)
> Jun  5 08:17:01 xander /USR/SBIN/CRON[6504]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 08:31:48 xander kernel: [50534.578689] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  5 09:17:01 xander /USR/SBIN/CRON[6716]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 10:17:01 xander /USR/SBIN/CRON[6929]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 11:17:01 xander /USR/SBIN/CRON[7114]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 12:17:01 xander /USR/SBIN/CRON[7287]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 13:17:01 xander /USR/SBIN/CRON[7458]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 14:17:01 xander /USR/SBIN/CRON[7508]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 15:17:01 xander /USR/SBIN/CRON[7557]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 16:12:35 xander dhclient: DHCPREQUEST on eth0 to 192.168.1.1 port 67
> Jun  5 16:12:35 xander dhclient: DHCPACK from 192.168.1.1
> Jun  5 16:12:35 xander dhclient: bound to 192.168.1.2 -- renewal in 
> 34670 seconds.
> Jun  5 16:17:01 xander /USR/SBIN/CRON[7608]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 16:31:21 xander kernel: [79307.749897] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
> Jun  5 17:17:01 xander /USR/SBIN/CRON[7663]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 18:11:30 xander kernel: [85317.130389] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  5 18:11:30 xander kernel: [85317.150673] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
> Jun  5 18:11:30 xander kernel: [85317.150685] [drm] nouveau 
> 0000:01:00.0: Output VGA-1 is running on CRTC 0 using output A
> Jun  5 18:11:30 xander kernel: [85317.324416] [drm] nouveau 
> 0000:01:00.0: Load detected on output A
> Jun  5 18:11:31 xander kernel: [85317.551962] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  5 18:11:31 xander kernel: [85317.572244] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
> Jun  5 18:11:31 xander kernel: [85317.572251] [drm] nouveau 
> 0000:01:00.0: Output VGA-1 is running on CRTC 0 using output A
> Jun  5 18:17:01 xander /USR/SBIN/CRON[7767]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 18:54:21 xander kernel: [87887.751760] [drm] nouveau 
> 0000:01:00.0: Setting dpms mode 3 on vga encoder (output 0)
> Jun  5 19:17:01 xander /USR/SBIN/CRON[7873]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Jun  5 21:08:47 xander kernel: imklog 4.6.4, log source = /proc/kmsg 
> started.
> Jun  5 21:08:47 xander rsyslogd: [origin software="rsyslogd" 
> swVersion="4.6.4" x-pid="1259" x-info="http://www.rsyslog.com";] (re)start

Pretty uneventful too... 

> If there is anything here that points to a problem, please let me know.  
> If there is any other data that would help diagnose this, I will do my 
> best to provide it.

An experiment which may exclude the video drivers from the equation:
Try NOT starting X ?  If it still crashes without X ever being
started, then it points towards the problem being elsewhere...

Another thing that may help: If you have another system on the
network, configure (r)syslog to log to the remote system.  When a
system crashes it may not be able to write logs locally, but often
network logging still works. This may allow you to capture log entries
regarding the crash itself.

To diagnose a completely dead system: Does Caps Lock work? (enough to
toggle the light)

Does it respond to Ping? (if not Ping: how about ARP?)

If it dies due to a kernel panic, you may be able to get it to
automagically reboot by adding "panic=60" to the kernel command line
(will make it reboot 60 seconds after a kernel panic). Obviously this
will not solve the problem, but may make the system less unusable
while the probelm is still ongoing...

-- 
Karl E. Jorgensen
IT Operations


Reply to: