[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

repeated kernel crashes with PCI passthru



Hi all,

I have recently upgraded one of my Debian servers
from  XEN 3.2 / Kernel 2.6.26
to XEN 4.0 / Kernel 2.6.32.

I have configured PCI passthru for a NIC.

Since the current Debian pvops kernel does not have the xen pci frontend
driver required for PCI passthru, I am running a XEN kernel in both dom0
and domU, so actual kernel versions are:

dom0:  2.6.32-5-xen-amd64 #1 SMP Tue Jun 1
domU: 2.6.32-5-xen-686 #1 SMP Tue Jul 6
the hypervisor is 4.0.1-rc3

(Random notes:
 1. the dom0 is 64bit, this domU is 32bit.
 2. The dom0 kernel is not the latest (-16), but the one before (-15),
because the current one won't boot up, see #588509 and #588426.
)

   * * *

So, the system boots up as it should, but sometimes the domU crashes, with messages like these:

---------------------

[27047.101954] BUG: unable to handle kernel paging request at 00d90200
[27047.101979] IP: [<c11f01aa>] skb_release_data+0x71/0x90
[27047.102000] *pdpt = 0000000001c21027 *pde = 0000000000000000 
[27047.102019] Thread overran stack, or stack corrupted
[27047.102031] Oops: 0000 [#1] SMP 
[27047.102047] last sysfs file: /sys/devices/virtual/net/ppp0/uevent
[27047.102060] Modules linked in: tun xt_limit nf_nat_irc nf_nat_ftp ipt_LOG ipt_MASQUERADE xt_DSCP ipt_REJECT nf_conntrack_irc nf_conntrack_ftp xt_state xt_TCPMSS xt_tcpmss xt_tcpudp pppoe pppox ppp_generic slhc sundance mii iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables x_tables dm_snapshot dm_mirror dm_region_hash dm_log dm_mod loop evdev snd_pcsp snd_pcm snd_timer snd xen_netfront soundcore snd_page_alloc ext3 jbd mbcache thermal_sys xen_blkfront
[27047.102275] 
[27047.102285] Pid: 0, comm: swapper Not tainted (2.6.32-5-xen-686 #1) 
[27047.102298] EIP: 0061:[<c11f01aa>] EFLAGS: 00010206 CPU: 0
[27047.102310] EIP is at skb_release_data+0x71/0x90
[27047.102321] EAX: 00d90200 EBX: 00000000 ECX: c2939c10 EDX: cec6b500
[27047.102333] ESI: cf8f0a80 EDI: cf8f09c0 EBP: c13919c8 ESP: c1383eec
[27047.102346]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[27047.102358] Process swapper (pid: 0, ti=c1382000 task=c13c2ba0 task.ti=c13820
[27047.102371] Stack:
[27047.102379]  cf8f0a80 c293a700 c11efdfb cf8f09c0 c11f4c35 00000011 c1380000 00000002
[27047.102415] <0> 00000008 c13919c8 c103c1ec c14594b0 00000001 0000000a 00000000 00000100
[27047.102455] <0> c1380000 00000000 c13c5d18 00000000 c103c2c4 00000000 c1383f5c c103c39a
[27047.102499] Call Trace:
[27047.102512]  [<c11efdfb>] ? __kfree_skb+0xf/0x6e
[27047.102527]  [<c11f4c35>] ? net_tx_action+0x58/0xf9
[27047.102542]  [<c103c1ec>] ? __do_softirq+0xaa/0x151
[27047.102557]  [<c103c2c4>] ? do_softirq+0x31/0x3c
[27047.102570]  [<c103c39a>] ? irq_exit+0x26/0x58
[27047.102586]  [<c1198a46>] ? xen_evtchn_do_upcall+0x22/0x2c
[27047.102604]  [<c1009b7f>] ? xen_do_upcall+0x7/0xc
[27047.102630]  [<c10023a7>] ? hypercall_page+0x3a7/0x1001
[27047.102647]  [<c1006169>] ? xen_safe_halt+0xf/0x1b
[27047.102661]  [<c10042bf>] ? xen_idle+0x23/0x30
[27047.102676]  [<c1008168>] ? cpu_idle+0x89/0xa5
[27047.102691]  [<c13fb80d>] ? start_kernel+0x318/0x31d
[27047.102706]  [<c13fd3c3>] ? xen_start_kernel+0x615/0x61c
[27047.102721]  [<c1409045>] ? print_local_APIC+0x61/0x380
[27047.102732] Code: 8b 44 02 30 e8 9a 4f ea ff 8b 96 a4 00 00 00 0f b7 42 04 39 c3 7c e5 8b 96 a4 00 00 00 8b 42 1c 85 c0 74 16 c7 42 1c 00 00 00 00 <8b> 18 e8 d2 fc ff ff 85 db 74 04 89 d8 eb f1 8b 86 a8 00 00 00 
[27047.102981] EIP: [<c11f01aa>] skb_release_data+0x71/0x90 SS:ESP 0069:c1383eec
[27047.103003] CR2: 0000000000d90200
[27047.103018] ---[ end trace a577dfc0e629cd07 ]---
[27047.103028] Kernel panic - not syncing: Fatal exception in interrupt
[27047.103042] Pid: 0, comm: swapper Tainted: G      D    2.6.32-5-xen-686 #1
[27047.103053] Call Trace:
[27047.103065]  [<c128ae0d>] ? panic+0x38/0xe4
[27047.103078]  [<c128d419>] ? oops_end+0x91/0x9d
[27047.103092]  [<c1021b5a>] ? no_context+0x134/0x13d
[27047.103106]  [<c1021c78>] ? __bad_area_nosemaphore+0x115/0x11d
[27047.103121]  [<c10067f0>] ? check_events+0x8/0xc
[27047.103135]  [<c10067e7>] ? xen_restore_fl_direct_end+0x0/0x1
[27047.103155]  [<d0823fdb>] ? xennet_poll+0xaeb/0xb04 [xen_netfront]
[27047.103170]  [<c10211df>] ? pvclock_clocksource_read+0xf9/0x10f
[27047.103185]  [<c10060e8>] ? xen_force_evtchn_callback+0xc/0x10
[27047.103200]  [<c114a00f>] ? xen_swiotlb_unmap_page+0x0/0x7
[27047.103214]  [<c10067f0>] ? check_events+0x8/0xc
[27047.103227]  [<c10060e8>] ? xen_force_evtchn_callback+0xc/0x10
[27047.103242]  [<c128e3f4>] ? do_page_fault+0x115/0x307
[27047.103255]  [<c128e2df>] ? do_page_fault+0x0/0x307
[27047.103268]  [<c1021c8a>] ? bad_area_nosemaphore+0xa/0xc
[27047.103282]  [<c128cb0b>] ? error_code+0x73/0x78
[27047.103295]  [<c11f01aa>] ? skb_release_data+0x71/0x90
[27047.103308]  [<c11efdfb>] ? __kfree_skb+0xf/0x6e
[27047.103321]  [<c11f4c35>] ? net_tx_action+0x58/0xf9
[27047.103335]  [<c103c1ec>] ? __do_softirq+0xaa/0x151
[27047.103348]  [<c103c2c4>] ? do_softirq+0x31/0x3c
[27047.103361]  [<c103c39a>] ? irq_exit+0x26/0x58
[27047.103374]  [<c1198a46>] ? xen_evtchn_do_upcall+0x22/0x2c
[27047.103388]  [<c1009b7f>] ? xen_do_upcall+0x7/0xc
[27047.103401]  [<c10023a7>] ? hypercall_page+0x3a7/0x1001
[27047.103415]  [<c1006169>] ? xen_safe_halt+0xf/0x1b
[27047.103428]  [<c10042bf>] ? xen_idle+0x23/0x30
[27047.103440]  [<c1008168>] ? cpu_idle+0x89/0xa5
[27047.103454]  [<c13fb80d>] ? start_kernel+0x318/0x31d
[27047.103467]  [<c13fd3c3>] ? xen_start_kernel+0x615/0x61c
[27047.103481]  [<c1409045>] ? print_local_APIC+0x61/0x380
------------------------------------------------------------------------------------

Then, since the IRQ of the card is shared with the SATA controller,
this basically kills the whole host, requiring a HW reset.

(Sometimes this second problem also occurs when I am rebooting the domU normally;
see http://lists.xensource.com/archives/html/xen-devel/2009-07/msg00224.html
for the thread about the shared IRQ problem. )

This happens once in a few days, sometimes in a few hours, basically making
the whole system unusable.

   * * *

Does anybody have any idea what could be happening here? How can I fix this?

Thank you for your help:

    Kristof Csillag



Reply to: