[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs



Hi Ian,

On 05/04/12 21:04, Ian Campbell wrote:
On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote:
Those issues were believed to be fixed in 2.6.32-34 and you are running
2.6.32-39 so either this is a different issue (perhaps with similar
symptoms) or the issue isn't really fixed. Either way I think we need to
see your kernel logs containing the actual oops in order to make any
progress.
Yes, we have been having this problem since before 2.6.32-34 and were
very hopeful that change would fix it. This sadly was not the case.
Unfortunately there isn't anything in the logs for this, but I have a
screenshot from the console, which I have attached.
Thanks.

Googling around for issues with sync_super threw up
https://bugzilla.redhat.com/show_bug.cgi?id=587265 and
https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the
second one mentioned issues with IRQ handling which reminded me that a
bunch of those were fixed 2.6.32-40 whereas you are running -39 (which
is fair enough since that is the version currently in stable). Could you
try the kernel from stable-proposed-updates (now 2.6.32-43)?

Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports
the interrupt problem theory.

Thanks for that, I think you could be on the right track here.

On another dom0 which crashed over the weekend I observed the following behaviour at least 6 times while doing a raid re-sync, unsure if this is related at all, but disk utilisation is low & the server is responsive.

Apr 10 03:30:47 dom0 kernel: [261639.807061] INFO: task umount:18216 blocked for more than 120 seconds. Apr 10 03:30:47 dom0 kernel [261639.807098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 10 03:30:47 dom0 kernel: [261639.807146] umount D 0000000000000002 0 18216 18214 0x00000000 Apr 10 03:30:47 dom0 kernel: [261639.807152] ffff88003d44cdb0 0000000000000286 000000043ca5fd08 ffff88003ca5fd98 Apr 10 03:30:47 dom0 kernel: [261639.807157] 0000000000000000 0000000000000000 000000000000f9e0 ffff88003ca5ffd8 Apr 10 03:30:47 dom0 kernel: [261639.807161] 0000000000015780 0000000000015780 ffff88003d44a350 ffff88003d44a648
Apr 10 03:30:47 dom0 kernel: [261639.807165] Call Trace:
Apr 10 03:30:47 dom0 kernel: [261639.807177] [<ffffffff8100e635>] ? xen_force_evtchn_callback+0x9/0xa Apr 10 03:30:47 dom0 kernel: [261639.807180] [<ffffffff8100ecf2>] ? check_events+0x12/0x20 Apr 10 03:30:47 dom0 kernel: [261639.807186] [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268 Apr 10 03:30:47 dom0 kernel: [261639.807191] [<ffffffff81109647>] ? bdi_sched_wait+0x0/0xe Apr 10 03:30:47 dom0 kernel: [261639.807194] [<ffffffff81109650>] ? bdi_sched_wait+0x9/0xe Apr 10 03:30:47 dom0 kernel: [261639.807201] [<ffffffff8130deda>] ? _spin_unlock_irqrestore+0xd/0xe Apr 10 03:30:47 dom0 kernel: [261639.807205] [<ffffffff8130d127>] ? __wait_on_bit+0x41/0x70 Apr 10 03:30:47 dom0 kernel: [261639.807208] [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268 Apr 10 03:30:47 dom0 kernel: [261639.807211] [<ffffffff81109647>] ? bdi_sched_wait+0x0/0xe Apr 10 03:30:47 dom0 kernel: [261639.807214] [<ffffffff8130d1c1>] ? out_of_line_wait_on_bit+0x6b/0x77 Apr 10 03:30:47 dom0 kernel: [261639.807219] [<ffffffff81066048>] ? wake_bit_function+0x0/0x23 Apr 10 03:30:47 dom0 kernel: [261639.807222] [<ffffffff811096c8>] ? sync_inodes_sb+0x73/0x12a Apr 10 03:30:47 dom0 kernel: [261639.807227] [<ffffffff8110d27d>] ? __sync_filesystem+0x4b/0x70 Apr 10 03:30:47 dom0 kernel: [261639.807239] [<ffffffff810f1e5e>] ? generic_shutdown_super+0x21/0xfa Apr 10 03:30:47 dom0 kernel: [261639.807242] [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1 Apr 10 03:30:47 dom0 kernel: [261639.807245] [<ffffffff810f1f59>] ? kill_block_super+0x22/0x3a Apr 10 03:30:47 dom0 kernel: [261639.807249] [<ffffffff810f2629>] ? deactivate_super+0x60/0x77 Apr 10 03:30:47 dom0 kernel: [261639.807254] [<ffffffff81104f9c>] ? sys_umount+0x2dc/0x30b Apr 10 03:30:47 dom0 kernel: [261639.807257] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b

When reviewing our logs across multiple crashes affecting multiple physical servers I note that 1 process is umounting the snapshot while another process takes a new snapshot as the last log entry on the server before the oops.

It will take us a few days (probably a week) or so to get this new kernel rolled out, but will post an update here on how that has changed things once I know more.

If there's any chance of setting up a serial console to catch this issue
should it happen again then that would be very useful too.

Will also be looking into the possibility of setting this up, as it has been happening fairly frequently for us.

Thanks for your help so far. :-)

Best Regards,

Quintin



Reply to: