This is not the same bug as was originally reported, which is that virtio_net failed to retry refilling its RX buffer ring. That is definitely fixed. So I'm treating this as a new bug report, #592187. On Sat, 2010-08-07 at 18:17 +0200, Lukas Kolbe wrote: > Am Samstag, den 07.08.2010, 12:18 +0100 schrieb Ben Hutchings: > > On Sat, 2010-08-07 at 11:21 +0200, Lukas Kolbe wrote: > > > Hi, > > > > > > I sent this earlier today but the bug was archived so it didn't appear > > > anywhere, hence the resend. > > > > > > I believe this issue is not fixed at all in 2.6.32-18. We have seen this > > > behaviour in various kvm guests using virtio_net with the same kernel in > > > the guest only minutes after starting the nightly backup (rdiff-backup > > > to an nfs-volume on a remote server), eventually leading to a > > > non-functional network. Often, the machines even do not reboot and hang > > > instead. Using the rtl8139 instead of virtio helps, but that's really > > > only a clumsy workaround. > > [...] > > > > I think you need to give your guests more memory. > > They all have between 512M and 2G - and it happens to all of them using > virtio_net, and none of them using rtl8139 as a network driver, > reproducibly. The RTL8139 hardware uses a single fixed RX DMA buffer. The virtio 'hardware' allows the host to write into RX buffers anywhere in guest memory. This results in very different allocation patterns. Please try specifying 'e1000' hardware, i.e. an Intel gigabit controller. I think the e1000 driver will have a similar allocation pattern to virtio_net, so you can see whether it also triggers allocation failures and a network stall in the guest. Also, please test Linux 2.6.35 in the guest. This is packaged in the 'experimental' suite. [...] > If it would be an OOM situation, wouldn't the OOM-killer be supposed to > kick in? [...] The log you sent shows failure to allocate memory in an 'atomic' context where there is no opportunity to wait for pages to be swapped out. The OOM killer isn't triggered until the system is running out of memory despite swapping out pages. Also, I note that following the failure of virtio_net to refill its RX buffer ring, I see failures to allocate buffers for sending TCP ACKs. So the guest drops the ACKs, and that TCP connection will stall temporarily (until the peer re-sends the unacknowledged packets). I also see 'nfs: server fileserver.backup.TechFak.Uni-Bielefeld.DE not responding, still trying'. This suggests that the allocation failure in virtio_net has resulted in dropping packets from the NFS server. And it just makes matters worse as it becomes impossible to free memory by flushing out buffers over NFS! Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse.
Description: This is a digitally signed message part