[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes



On Tue, 8 Aug 2017 11:38:06 +0200 (CEST) Sven Hartge <sven@svenhartge.de> wrote:
> Um 16:22 Uhr am 03.08.17 schrieb Sven Hartge:
> > On 03.08.2017 15:34, Patrick Matthäi wrote:
> >> Am 16.07.2017 um 23:42 schrieb Ben Hutchings:
> >>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:
>  
> >>>>> Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?
> >>> Note that this has been root-caused as a bug in the virtual device, not
> >>> the driver.  (Though it would be nice if the driver could work around
> >>> it.)
> >
> >> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build
> >> 5969303 from 27.07.2017, that is why I lowered the severity.
> >
> > This is the version from 6.5u1, right?
> >
> > Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1.
>
> Hmm. There are discussions on Reddit right now indicating the bug still
> occurs even with the latest ESXi6.5u1 (Build 5969303).
>
> https://www.reddit.com/r/homelab/comments/6s5dh6/debian_9_on_esxi_65u1_complete_lockup/
>
> One of the latest comments on the Kernel Bugzilla shows the same:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=191201#c54
>
> (For me, this is really frustrating right now, since I waited until
> ESX6.5u1 before updating my infrastructure and now it seems I have to push
> this update even farther into the future because of this critical blocker
> bug.)
>
> I really wonder what could be done on the Kernel side to avoid the
> problem, since only newer Kernel are affected while older one don't show
> the problem.
>
> Grüße,
> Sven.
>
>
Hi Sven,

Both of those reports were me. I suspect the issue may be isolated to the HPE custom implementation of the ESXi 6.5u1 build. I haven't seen any similar reports of people using the vanilla 6.5u1 build.

Interestingly none of the fixes that have been discussed work with this build either. This includes disabling the rx-mini buffer (# ethtool -G <interface> rx-mini 0) and adding vmxnet3.rev.30 = FALSE to the VMs vmx file.

The only way I've managed to restore stability is by removing vmxnet3 out of the equation completely and changing to the e1000 NIC type.

Thanks,
Andrew
 

Reply to: