Bug#592187: Bug#576838: virtio network crashes again

To: Ben Hutchings <ben@decadent.org.uk>
Cc: 592187@bugs.debian.org
Subject: Bug#592187: Bug#576838: virtio network crashes again
From: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de>
Date: Mon, 09 Aug 2010 09:20:17 +0200
Message-id: <1281338417.11319.20.camel@larosa.fritz.box>
Reply-to: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de>, 592187@bugs.debian.org
In-reply-to: <1281234965.7543.128.camel@localhost>
References: <1281172902.7018.49.camel@larosa.fritz.box> <1281179915.7543.12.camel@localhost> <1281197867.11319.6.camel@larosa.fritz.box> <1281234965.7543.128.camel@localhost>

Hi Ben,

Am Sonntag, den 08.08.2010, 03:36 +0100 schrieb Ben Hutchings:
> This is not the same bug as was originally reported, which is that
> virtio_net failed to retry refilling its RX buffer ring.  That is
> definitely fixed.  So I'm treating this as a new bug report, #592187.

Okay, thanks. 

> > > I think you need to give your guests more memory.
> > 
> > They all have between 512M and 2G - and it happens to all of them using
> > virtio_net, and none of them using rtl8139 as a network driver,
> > reproducibly.
> 
> The RTL8139 hardware uses a single fixed RX DMA buffer.  The virtio
> 'hardware' allows the host to write into RX buffers anywhere in guest
> memory.  This results in very different allocation patterns.
> 
> Please try specifying 'e1000' hardware, i.e. an Intel gigabit
> controller.  I think the e1000 driver will have a similar allocation
> pattern to virtio_net, so you can see whether it also triggers
> allocation failures and a network stall in the guest.
> 
> Also, please test Linux 2.6.35 in the guest.  This is packaged in the
> 'experimental' suite.

I'll rig up a test machine (the crashes all occured on production
guests, unfortunatly) and report back. 

> [...]
> > If it would be an OOM situation, wouldn't the OOM-killer be supposed to
> > kick in?
> [...]
> 
> The log you sent shows failure to allocate memory in an 'atomic' context
> where there is no opportunity to wait for pages to be swapped out.  The
> OOM killer isn't triggered until the system is running out of memory
> despite swapping out pages.

Ah, good to know, thanks!

> Also, I note that following the failure of virtio_net to refill its RX
> buffer ring, I see failures to allocate buffers for sending TCP ACKs.
> So the guest drops the ACKs, and that TCP connection will stall
> temporarily (until the peer re-sends the unacknowledged packets).
> 
> I also see 'nfs: server fileserver.backup.TechFak.Uni-Bielefeld.DE not
> responding, still trying'.  This suggests that the allocation failure in
> virtio_net has resulted in dropping packets from the NFS server.  And it
> just makes matters worse as it becomes impossible to free memory by
> flushing out buffers over NFS!

This sounds quite bad. 

This problem *seems* to be fixed by 2.6.32-19: we upgraded to that on a
different machine for host and guests, and an rsync of ~1TiB of data
didn't produce any page allocation failures using virtio. But I'd wait
for my tests with rsync/nfs and 2.6.32-18+e1000, 2.6.32-18+virtio
2.6.32-19+virtio and 2.6.35+virtio to conclude that.

Thanks for taking your time to explain things!

-- 
Lukas

Reply to:

References:
- Bug#576838: virtio network crashes again
  - From: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de>
- Bug#576838: virtio network crashes again
  - From: Ben Hutchings <ben@decadent.org.uk>
- Bug#576838: virtio network crashes again
  - From: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de>
- Bug#592187: Bug#576838: virtio network crashes again
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Bug#592307: linux-image-2.6.32-5: Laptop-Display flicker after a short time
Next by Date: Re: DRM stable updates beyond 2.6.33.6
Previous by thread: Bug#592187: Bug#576838: virtio network crashes again
Next by thread: Bug#592187: Bug#576838: virtio network crashes again
Index(es):
- Date
- Thread