Bug#1104670: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
- To: "Hutchings, Ben" <ben@decadent.org.uk>, "intel-wired-lan@lists.osuosl.org" <intel-wired-lan@lists.osuosl.org>, linux-pci <linux-pci@vger.kernel.org>, Pavan Chebbi <pavan.chebbi@broadcom.com>, Michael Chan <mchan@broadcom.com>
- Cc: Laurent Bonnaud <L.Bonnaud@laposte.net>, "1104670@bugs.debian.org" <1104670@bugs.debian.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>
- Subject: Bug#1104670: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
- From: "Loktionov, Aleksandr" <aleksandr.loktionov@intel.com>
- Date: Mon, 14 Jul 2025 09:21:25 +0000
- Message-id: <[🔎] IA3PR11MB898660E9CAF3728B3544C6C5E554A@IA3PR11MB8986.namprd11.prod.outlook.com>
- Reply-to: "Loktionov, Aleksandr" <aleksandr.loktionov@intel.com>, 1104670@bugs.debian.org
- In-reply-to: <[🔎] c40b5e6cb26654f698e51b131956065b952ad222.camel@decadent.org.uk>
- References: <89159d74-c343-480f-9509-b6457244d65d@laposte.net> <8a232a97-5917-41d3-8e88-e68abdc83202@laposte.net> <[🔎] c40b5e6cb26654f698e51b131956065b952ad222.camel@decadent.org.uk> <89159d74-c343-480f-9509-b6457244d65d@laposte.net>
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Ben Hutchings
> Sent: Saturday, July 12, 2025 5:13 PM
> To: intel-wired-lan@lists.osuosl.org; linux-pci <linux-
> pci@vger.kernel.org>; Pavan Chebbi <pavan.chebbi@broadcom.com>;
> Michael Chan <mchan@broadcom.com>
> Cc: Laurent Bonnaud <L.Bonnaud@laposte.net>; 1104670@bugs.debian.org;
> netdev@vger.kernel.org
> Subject: Re: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64:
> system does not shut down - GHES: Fatal hardware error
>
> Hi all,
>
> On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote:
> [...]
> > - Previously the kernel would output an error in
> /var/lib/systemd/pstore/ but would shutdown anyway.
> >
> > - Now, with kernel 6.1.135-1, the shutdown is blocked as with
> 6.12.x kernels (see below).
> > --
> > Laurent.
> >
> > <30>[ 961.098671] systemd-shutdown[1]: Rebooting.
> > <6>[ 961.098743] kvm: exiting hardware virtualization <6>[
> > 961.361878] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion
> is
> > called outbound_intr_mask:0x40000009 <6>[ 961.414526] ACPI: PM:
> > Preparing to enter system sleep state S5 <0>[ 963.828210]
> > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
> > Source: 5 <0>[ 963.828213] {1}[Hardware Error]: event severity:
> fatal <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal
> > <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error
> > <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end
> point
> > <0>[ 963.828217] {1}[Hardware Error]: version: 3.0
> > <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status:
> 0x0010
> > <0>[ 963.828220] {1}[Hardware Error]: device_id: 0000:01:00.1
> > <0>[ 963.828221] {1}[Hardware Error]: slot: 6
> > <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00
> > <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086,
> device_id: 0x1563
> > <0>[ 963.828224] {1}[Hardware Error]: class_code: 020000
> > <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status:
> 0x00100000, aer_uncor_mask: 0x00018000
> > <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity:
> 0x000ef010
> > <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 40000001
> 0000000f 90028090 00000000
> [...]
>
> It seems that this is a known bug in the BIOS of several Dell
> PowerEdge models including (in this case) the R540.
>
> A workaround was added to the tg3 driver
> <https://git.kernel.org/linus/e0efe83ed325277bb70f9435d4d9fc70bebdcca8
> >
> and a similar change was proposed (but not accepted) in the i40e
> driver <https://lore.kernel.org/all/20241227035459.90602-1-
> yue.zhao@shopee.com/>.
> On tihis system the erorr log points to a deivce handled by the ixgbe
> driver, and no workaround has been implemented for that.
>
> Since this issue seems to affect multiple different NIC vendors and
> drivers, would it make more sense to implement this workaround as a
> PCI quirk?
>
I support the idea of PCI workaround, but who will implement it ?
Alex
> Ben.
>
> --
> Ben Hutchings
> Experience is directly proportional to the value of equipment
> destroyed
> - Carolyn
> Scheppner
Reply to: