[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#592187: Bug#576838: virtio network crashes again



Hi Ben,

Am Sonntag, den 08.08.2010, 03:36 +0100 schrieb Ben Hutchings:
> This is not the same bug as was originally reported, which is that
> virtio_net failed to retry refilling its RX buffer ring.  That is
> definitely fixed.  So I'm treating this as a new bug report, #592187.

Okay, thanks. 

> > > I think you need to give your guests more memory.
> > 
> > They all have between 512M and 2G - and it happens to all of them using
> > virtio_net, and none of them using rtl8139 as a network driver,
> > reproducibly.
> 
> The RTL8139 hardware uses a single fixed RX DMA buffer.  The virtio
> 'hardware' allows the host to write into RX buffers anywhere in guest
> memory.  This results in very different allocation patterns.
> 
> Please try specifying 'e1000' hardware, i.e. an Intel gigabit
> controller.  I think the e1000 driver will have a similar allocation
> pattern to virtio_net, so you can see whether it also triggers
> allocation failures and a network stall in the guest.
> 
> Also, please test Linux 2.6.35 in the guest.  This is packaged in the
> 'experimental' suite.

I'll rig up a test machine (the crashes all occured on production
guests, unfortunatly) and report back. 

> [...]
> > If it would be an OOM situation, wouldn't the OOM-killer be supposed to
> > kick in?
> [...]
> 
> The log you sent shows failure to allocate memory in an 'atomic' context
> where there is no opportunity to wait for pages to be swapped out.  The
> OOM killer isn't triggered until the system is running out of memory
> despite swapping out pages.

Ah, good to know, thanks!

> Also, I note that following the failure of virtio_net to refill its RX
> buffer ring, I see failures to allocate buffers for sending TCP ACKs.
> So the guest drops the ACKs, and that TCP connection will stall
> temporarily (until the peer re-sends the unacknowledged packets).
> 
> I also see 'nfs: server fileserver.backup.TechFak.Uni-Bielefeld.DE not
> responding, still trying'.  This suggests that the allocation failure in
> virtio_net has resulted in dropping packets from the NFS server.  And it
> just makes matters worse as it becomes impossible to free memory by
> flushing out buffers over NFS!

This sounds quite bad. 

This problem *seems* to be fixed by 2.6.32-19: we upgraded to that on a
different machine for host and guests, and an rsync of ~1TiB of data
didn't produce any page allocation failures using virtio. But I'd wait
for my tests with rsync/nfs and 2.6.32-18+e1000, 2.6.32-18+virtio
2.6.32-19+virtio and 2.6.35+virtio to conclude that.

Thanks for taking your time to explain things!

-- 
Lukas





Reply to: