[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1036530: linux-signed-amd64: Hard lock up of system



Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Tue Aug 23 13:51:31 2022 -0500

    ACPI: OSI: Remove Linux-Dell-Video _OSI string
    
    This string was introduced because drivers for NVIDIA hardware
    had bugs supporting RTD3 in the past.
    
    Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
    had a mechanism for switching PRIME on and off, though it had required
    to logout/login to make the library switch happen.
    
    When the PRIME had been off, the mechanism had unloaded the NVIDIA
    driver and put the device into D3cold, but the GPU had never come back
    to D0 again which is why ODMs used the _OSI to expose an old _DSM
    method to switch the power on/off.
    
    That has been fixed by commit 5775b843a619 ("PCI: Restore config space
    on runtime resume despite being unbound"). so vendors shouldn't be
    using this string to modify ASL any more.
    
    Reviewed-by: Lyude Paul <lyude@redhat.com>
    Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

 drivers/acpi/osi.c | 9 ---------
 1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.

Regards,

Nick.

* Salvatore Bonaccorso <carnil@debian.org> [230526 20:30]:
> Control: tags -1 + moreinfo
> 
> Hi Nick,
> 
> On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote:
> > Hi Salvatore,
> > 
> > thanks for your help. However, I'm now not sure if I really have
> > identified the commit that causes my problems. I fear I may have made
> > one or more mistakes when setting "git bisect good". I had been under
> > the impression that the lock up would happen no more than a few tens of
> > minutes after booting, however it seems that sometimes it can take a few
> > hours to occur.
> > 
> > So, I'm running the git bisect again and will be more careful before
> > marking "git bisect good". It could take a few days.
> > 
> > Should this particular bug be closed?
> 
> Thanks a lot for reporting back, you time put in into bisect is very
> appreciated and valued! No, no need to close this one, as the bug
> still persist. Just followup please once you have identified the
> culprit with the fresh bisect.
> 
> Please do remove by then as well the moreinfo tag again (you can write
> a control message with tag -1 - moreinfo, so won't appear as bug
> needing information from reporter).
> 
> Thank you!
> 
> Regards,
> Salvatore


Reply to: