Bug#592187: Bug#576838: virtio network crashes again
Hi Ben,
Am Sonntag, den 08.08.2010, 03:36 +0100 schrieb Ben Hutchings:
> This is not the same bug as was originally reported, which is that
> virtio_net failed to retry refilling its RX buffer ring. That is
> definitely fixed. So I'm treating this as a new bug report, #592187.
Okay, thanks.
> > > I think you need to give your guests more memory.
> >
> > They all have between 512M and 2G - and it happens to all of them using
> > virtio_net, and none of them using rtl8139 as a network driver,
> > reproducibly.
>
> The RTL8139 hardware uses a single fixed RX DMA buffer. The virtio
> 'hardware' allows the host to write into RX buffers anywhere in guest
> memory. This results in very different allocation patterns.
>
> Please try specifying 'e1000' hardware, i.e. an Intel gigabit
> controller. I think the e1000 driver will have a similar allocation
> pattern to virtio_net, so you can see whether it also triggers
> allocation failures and a network stall in the guest.
>
> Also, please test Linux 2.6.35 in the guest. This is packaged in the
> 'experimental' suite.
I'll rig up a test machine (the crashes all occured on production
guests, unfortunatly) and report back.
> [...]
> > If it would be an OOM situation, wouldn't the OOM-killer be supposed to
> > kick in?
> [...]
>
> The log you sent shows failure to allocate memory in an 'atomic' context
> where there is no opportunity to wait for pages to be swapped out. The
> OOM killer isn't triggered until the system is running out of memory
> despite swapping out pages.
Ah, good to know, thanks!
> Also, I note that following the failure of virtio_net to refill its RX
> buffer ring, I see failures to allocate buffers for sending TCP ACKs.
> So the guest drops the ACKs, and that TCP connection will stall
> temporarily (until the peer re-sends the unacknowledged packets).
>
> I also see 'nfs: server fileserver.backup.TechFak.Uni-Bielefeld.DE not
> responding, still trying'. This suggests that the allocation failure in
> virtio_net has resulted in dropping packets from the NFS server. And it
> just makes matters worse as it becomes impossible to free memory by
> flushing out buffers over NFS!
This sounds quite bad.
This problem *seems* to be fixed by 2.6.32-19: we upgraded to that on a
different machine for host and guests, and an rsync of ~1TiB of data
didn't produce any page allocation failures using virtio. But I'd wait
for my tests with rsync/nfs and 2.6.32-18+e1000, 2.6.32-18+virtio
2.6.32-19+virtio and 2.6.35+virtio to conclude that.
Thanks for taking your time to explain things!
--
Lukas
Reply to: