[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: kernel soft lockup (nagios related?)



Hello,

Just an update on this, I've disabled HyperThreading in the bios for
this machine and this appears to have resolved the crashing.

So it seems there is a bug here with Hyperthreading, and the 686
kernel with PAE enabled.. Might be worth trying to track this down a
bit further.

Cheers,
Blair


On Mon, Jan 23, 2012 at 9:42 AM, Blair Harrison <debian@jedi.school.nz> wrote:
> Hello,
>
> I've got a server which has now crashed a few times in a similar
> fashion, even tried moving to new hardware with similar effect (tho on
> the new hardware this seems to be happening more frequently), so this
> seems likely some interaction between nagios and the kernel causing a
> soft lock. Any ideas on how to resolve this would be appreciated.
> Unfortunately this is the only log I have of the event, the first
> event didn't produce any output like this, and I haven't got a record
> of the logs from the previous hardware as I thought they may have been
> isolated incidents.
>
> The previous hardware was running Lenny rather than Squeeze, so this
> seems not isolated to just one version of anything in particular.
>
> Let me know if there is any more information which would be of use.
>
> There's quite a few bits of software running on here, RTG, Cricket,
> Nagios, smokeping, rancid
>
> Debian 6.0.3
>
> Linux zzz-zzz 2.6.32-5-686-bigmem #1 SMP Wed Jan 11 13:17:56 UTC 2012
> i686 GNU/Linux
>
> Jan 22 22:40:40 zzz-zzz kernel: [176617.648985] BUG: soft lockup -
> CPU#13 stuck for 61s! [nagios3:2070]
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649040] Modules linked in:
> netconsole configfs joydev usbhid hid xt_multiport iptable_filter
> ip_tables x_tables 8021q garp stp loop snd_pcm snd_timer snd soundcore
> snd_page_alloc ioatdma pcspkr evdev cdc_ether usbnet button processor
> serio_raw dca mii shpchp pci_hotplug i2c_i801 i2c_core ext4 mbcache
> jbd2 crc16 raid10 md_mod sd_mod crc_t10dif ata_generic uhci_hcd
> megaraid_sas ata_piix ehci_hcd libata usbcore scsi_mod nls_base
> thermal bnx2 thermal_sys [last unloaded: netconsole]
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649078]
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649082] Pid: 2070, comm:
> nagios3 Not tainted (2.6.32-5-686-bigmem #1) System x3550 M3
> -[7944D2M]-
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649085] EIP: 0060:[<c10249bb>]
> EFLAGS: 00000202 CPU: 13
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649094] EIP is at
> native_flush_tlb_others+0x85/0xa6
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649096] EAX: 00000282 EBX:
> c14661ac ECX: c10200d8 EDX: 00000020
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649099] ESI: 00000005 EDI:
> 00000140 EBP: c14661a0 ESP: ee4c9a3c
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649101]  DS: 007b ES: 007b FS:
> 00d8 GS: 00e0 SS: 0068
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649104] CR0: 8005003b CR2:
> b758a376 CR3: 2eb7e000 CR4: 000006f0
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649106] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649108] DR6: ffff0ff0 DR7: 00000400
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649110] Call Trace:
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649116]  [<c1024aa3>] ?
> flush_tlb_page+0x5d/0x65
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649120]  [<c1023e90>] ?
> ptep_set_access_flags+0x59/0x63
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649125]  [<c10a1040>] ?
> do_wp_page+0x3b9/0x7dd
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649131]  [<c1031770>] ?
> finish_task_switch+0x76/0x95
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649135]  [<c10b61a0>] ?
> kmem_cache_free+0x78/0xaf
> Jan 22 22:40:40 zzz-zzz kernel: [176617.649138]  [<c1031770>] ?
> finish_task_switch+0x76/0x95
> Jan 22 22:40:40 zzz-zzz kernel: [1766Jan 23 07:13:24 zzz-zzz
> syslog-ng[1807]: syslog-ng starting up; version='3.1.3'
>
> Cheers,
> Blair


Reply to: