[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#640941: xen dom0 crash: unable to handle kernel paging request / Oops / Kernel panic



Hi Ian,

On 09/11/2011 05:37 PM, Ian Campbell wrote:
On Sun, 2011-09-11 at 01:18 +0200, Hans van Kranenburg wrote:
On 09/08/2011 07:12 PM, Hans van Kranenburg wrote:

When putting disk/network load on one of our office servers, Xen/dom0
crashes. Triple ctrl-a does not react any more on serial console.

[...]
A workaround should be to set the option disable_sendpage when loading
the drbd module [3], [4].

Sounds similar to the NFS issue[0] which caused me to begin working on
the SKB paged fragment destructor patches[1]. I just gave a talk about
this problem at LPC last week[2]

They are still a WIP but I hope to have them ready for Linux 3.3. I will
include DRDB in my list of subsystems to consider.

That sure is very interesting material to read. Ths original stack trace I posted makes a lot more sense to me now.

In the meantime disabling sendpage sounds like the best workaround.

So, we set the disable_sendpage option, did a domU reboot with drbdadm down/up of the drbd devices (just to be sure, don't know where/when this option is read by drbd), and after some days of hitting the disks and the network with data, no kernel panics happened anymore. Yay!

In the post you reference with [1] you write: "I expect that other block and filesystem users of the network subsystem (e.g. iSCSI) would also benefit from this functionality since they will suffer from the same class of issue.". Part of my work in the near future is doing lenny->squeeze upgrades of a couple of systems where we use lvm backed block devices for domU's which are on dm-multipath on iSCSI. Should I be concerned about the same issues that can happen when using iSCSI on squeeze? If so, or if unknown, do you recommend specific (stress)tests that we can do at the test-upgrade environment?

[1] http://marc.info/?l=linux-netdev&m=131072801125521&w=2

What should be done with this bug report? Should I close it, as there's a workaround, and there's no simple fix that can be done in squeeze, or should it be hanging around to be closed when the work on this is done and included in the kernel?

Thanks!

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg@mendix.com | www.mendix.com



Reply to: