[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#592187: Bug#576838: virtio network crashes again

This is not the same bug as was originally reported, which is that
virtio_net failed to retry refilling its RX buffer ring.  That is
definitely fixed.  So I'm treating this as a new bug report, #592187.

On Sat, 2010-08-07 at 18:17 +0200, Lukas Kolbe wrote:
> Am Samstag, den 07.08.2010, 12:18 +0100 schrieb Ben Hutchings:
> > On Sat, 2010-08-07 at 11:21 +0200, Lukas Kolbe wrote:
> > > Hi,
> > > 
> > > I sent this earlier today but the bug was archived so it didn't appear
> > > anywhere, hence the resend.
> > > 
> > > I believe this issue is not fixed at all in 2.6.32-18. We have seen this
> > > behaviour in various kvm guests using virtio_net with the same kernel in
> > > the guest only minutes after starting the nightly backup (rdiff-backup
> > > to an nfs-volume on a remote server), eventually leading to a
> > > non-functional network. Often, the machines even do not reboot and hang
> > > instead. Using the rtl8139 instead of virtio helps, but that's really
> > > only a clumsy workaround.
> > [...]
> > 
> > I think you need to give your guests more memory.
> They all have between 512M and 2G - and it happens to all of them using
> virtio_net, and none of them using rtl8139 as a network driver,
> reproducibly.

The RTL8139 hardware uses a single fixed RX DMA buffer.  The virtio
'hardware' allows the host to write into RX buffers anywhere in guest
memory.  This results in very different allocation patterns.

Please try specifying 'e1000' hardware, i.e. an Intel gigabit
controller.  I think the e1000 driver will have a similar allocation
pattern to virtio_net, so you can see whether it also triggers
allocation failures and a network stall in the guest.

Also, please test Linux 2.6.35 in the guest.  This is packaged in the
'experimental' suite.

> If it would be an OOM situation, wouldn't the OOM-killer be supposed to
> kick in?

The log you sent shows failure to allocate memory in an 'atomic' context
where there is no opportunity to wait for pages to be swapped out.  The
OOM killer isn't triggered until the system is running out of memory
despite swapping out pages.

Also, I note that following the failure of virtio_net to refill its RX
buffer ring, I see failures to allocate buffers for sending TCP ACKs.
So the guest drops the ACKs, and that TCP connection will stall
temporarily (until the peer re-sends the unacknowledged packets).

I also see 'nfs: server fileserver.backup.TechFak.Uni-Bielefeld.DE not
responding, still trying'.  This suggests that the allocation failure in
virtio_net has resulted in dropping packets from the NFS server.  And it
just makes matters worse as it becomes impossible to free memory by
flushing out buffers over NFS!


Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part

Reply to: