Bug#1098261: linux: kernel panic on boot with certain large NVME configurations
On Tue, Feb 18, 2025 at 03:11:08PM +0100, Salvatore Bonaccorso wrote:
> > Microsoft has observed that the 5.10.y kernels in bullseye are susceptible
> > to crashes due to race conditions in the NVME/PCI subsystem. See below for
> > a representative kernel log. The problem appears most frequently in larger
> > systems, e.g. with 4 or more NVME devices and >= 64 CPUs, but it could
> > potentially occur on smaller systems as well.
> >
> > The issue was fixed with the 5.14 kernel upstream in e4b9852a0 ("nvme-pci:
> > fix multiple races in nvme_setup_io_queues"), so this only impacts
> > oldstable. I have provided a backport of this commit upstream in
> > https://lore.kernel.org/stable/E1tj8vO-00471h-2H@lore/
> >
> > I'm requesting that this commit be included in a bullseye kernel update.
>
> AFAICS, this backport has not been accepted back then for 5.10.y. Can
> you re-ping upstream to make sure it get included in the 5.10.y
> series? Once this has happened as we follow the 5.10.y series it will
> be included (or can be included in advance once it has been queued).
Yes, I forgot to reset the date on the commit that I sent upstream,
which is why it looks like it's been around since 2021. I requested
that upstream apply the fix to 5.10.y last week, and will ping them in
another week or two if it hasn't been acknowledged either way...
noah
Reply to: