[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bullseye (mostly) not booting on Proliant DL380 G7



Sorry for auto-responding all the time ;-)
I was just able to catch a "freeze" followed by a successful boot afterwards.

The successful boot continues with these lines:

[   62.922169] systemd[1]: Finished Create System Users.
[   62.923633] systemd[1]: Starting Create Static Device Nodes in /dev...
[   62.941753] systemd[1]: Finished Create Static Device Nodes in /dev.
[   62.944691] systemd[1]: Starting Rule-based Manager for Device Events and Files...
[   62.953082] systemd[1]: modprobe@drm.service: Succeeded.
[   62.953539] systemd[1]: Finished Load Kernel Module drm.
[   62.983630] systemd[1]: Started Rule-based Manager for Device Events and Files.
[   62.991307] systemd[1]: Finished Set the console keyboard layout.
[   63.015898] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input5
[   63.016490] systemd[1]: Finished Coldplug All udev Devices.
[   63.018250] systemd[1]: Starting Helper to synchronize boot up for ifupdown...
[   63.020119] power_meter ACPI000D:00: Found ACPI power meter.
[   63.020214] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[   63.020280] power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[   63.029971] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[   63.030392] systemd[1]: Reached target Local File Systems (Pre).
[   63.031784] IPMI message handler: version 39.2
[   63.035060] ipmi device interface
[   63.036149] ACPI: Power Button [PWRF]
[   63.038539] EDAC MC1: Giving out device to module i7core_edac.c controller i7 core #1: DEV 0000:3e:03.0 (INTERRUPT)
[   63.038670] EDAC PCI0: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:3e:03.0 (POLLED)
[   63.039204] EDAC MC0: Giving out device to module i7core_edac.c controller i7 core #0: DEV 0000:3f:03.0 (INTERRUPT)
[   63.039315] EDAC PCI1: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:3f:03.0 (POLLED)
[   63.039405] EDAC i7core: Driver loaded, 2 memory controller(s) found.
[   63.044910] ipmi_si: IPMI System Interface driver
[   63.044996] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[   63.045059] ipmi_platform: ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[   63.045134] ipmi_si: Adding SMBIOS-specified kcs state machine
[   63.045263] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[   63.045393] ipmi_si IPI0001:00: ipmi_platform: [io  0x0ca2-0x0ca3] regsize 1 spacing 1 irq 0
[   63.045652] iTCO_vendor_support: vendor-support=0
[   63.046504] hpwdt 0000:02:00.0: HPE Watchdog Timer Driver: NMI decoding initialized

This line catches my attention:

[   62.953082] systemd[1]: modprobe@drm.service: Succeeded.

This is missing (doesn't show) when the freeze happens.

FYI in the meantime I also installed firmware-amd-graphics however the behaviour (sometimes freeze, sometimes boot) is still the same.

I continue to troubleshoot but if anyone has experienced something similar or has some hints or can point to existing bugs please let me know.

On Tue, Jun 29, 2021 at 10:04 AM Claudio Kuenzler <ck@claudiokuenzler.com> wrote:
Meanwhile I was able to identify more by removing "quiet" from the grub loader.
The pcc_cpufreq_init does not seem to hurt the booting - these are just warnings popping up.

The following messages appear on the console before the server freezes:

[ OK ] Finished Load Kernel Module fuse.
[ 62.887855] systemd[1]: Mounting FUSE Control File System...
   Mounting FUSE Controle File System...
[ 62.891852] systemd[1]: Finished Apply Kernel Variables.
[ OK ] Finished Apply Kernel Variables.
[ 62.892237] systemd[1]: Mounted FUSE Control File System.
[ OK ] Mounted FUSE Control File System.
[ 62.900668] systemd[1]: Finished Create System Users.
[ OK ] Finished Create System Users.
[ 62.902224] systemd[1]: Starting Create Static Device Nodes in /dev...
  Starting Create Static Device Nodes in /dev...
[ 62.920767] systemd[1]: modprobe@drm.service: Succeeded.
[ 62.921202] systemd[1]: Finished Load Kernel Module drm.
[ OK ] Finished Load Kernel Module drm.
[ 62.921979] systemd[1]: Finished Create Static Device Nodes in /dev.
[ OK ] Finished Create Static Device Nodes in /dev.
[ 62.925007] systemd[1]: Starting Rule-based Manager for Device Events and Files...
   Starting Rule-based Manager for Device Events and Files...
[ 62.955322] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[ OK ] Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[ 62.962186] systemd[1]: Started Rule-based Manager for Device Events and Files.

After this, no further messages, no login prompt, server does not react to keyboard input anymore. Only a hardware reset works in this case.
Out of ~10 server reboots this problem occurred 4 or 5 times.

Could it have something to do with drm? I've seen a drm driver error during earlier boot phase.

Jun 28 16:15:05 irczsrvp08 kernel: [   63.182074] [drm] radeon kernel modesetting enabled.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.182197] radeon 0000:01:03.0: vgaarb: deactivate vga console
Jun 28 16:15:05 irczsrvp08 kernel: [   63.183720] Console: switching to colour dummy device 80x25
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184088] [drm] initializing kernel modesetting (RV100 0x1002:0x515E 0x103C:0x31FB 0x02).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184208] radeon 0000:01:03.0: VRAM: 128M 0x00000000E8000000 - 0x00000000EFFFFFFF (64M used)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184210] radeon 0000:01:03.0: GTT: 512M 0x00000000C8000000 - 0x00000000E7FFFFFF
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184219] [drm] Detected VRAM RAM=128M, BAR=128M
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184220] [drm] RAM width 16bits DDR
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184302] [TTM] Zone  kernel: Available graphics memory: 49487844 KiB
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184304] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184305] [TTM] Initializing pool allocator
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184310] [TTM] Initializing DMA pool allocator
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184333] [drm] radeon: 64M of VRAM memory ready
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184334] [drm] radeon: 512M of GTT memory ready.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184371] [drm] GART: num cpu pages 131072, num gpu pages 131072
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205645] [drm] PCI GART of 512M enabled (table at 0x00000000FFF00000).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205890] radeon 0000:01:03.0: WB disabled
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205894] radeon 0000:01:03.0: fence driver on ring 0 use gpu addr 0x00000000c8000000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205967] [drm] radeon: irq initialized.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205980] [drm] Loading R100 Microcode
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206233] radeon 0000:01:03.0: firmware: failed to load radeon/R100_cp.bin (-2)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206241] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206246] radeon 0000:01:03.0: Direct firmware load for radeon/R100_cp.bin failed with error -2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206311] [drm:r100_cp_init [radeon]] *ERROR* Failed to load firmware!
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206318] radeon 0000:01:03.0: failed initializing CP (-2).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206321] radeon 0000:01:03.0: Disabling GPU acceleration
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206329] [drm] radeon: cp finalized
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206961] [drm] No TV DAC info found in BIOS
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206996] [drm] Radeon Display Connectors
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206997] [drm] Connector 0:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206998] [drm]   VGA-1
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206999] [drm]   DDC: 0x60 0x60 0x60 0x60 0x60 0x60 0x60 0x60
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207000] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207001] [drm]     CRT1: INTERNAL_DAC1
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207002] [drm] Connector 1:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207003] [drm]   VGA-2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   DDC: 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207005] [drm]     CRT2: INTERNAL_DAC2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.236242] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jun 28 16:15:05 irczsrvp08 kernel: [   63.245005] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250269] [drm] fb mappable at 0xE8040000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250270] [drm] vram apper at 0xE8000000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] size 1572864
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] fb depth is 16
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250272] [drm]    pitch is 2048

Maybe related to the known bullseye errata https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989863 ?



On Mon, Jun 28, 2021 at 8:32 PM Claudio Kuenzler <ck@claudiokuenzler.com> wrote:
Hello!

Currently testing the new Bullseye release (using firmware-bullseye-DI-rc2-amd64-netinst.iso) and see a strange phenomenon on a HP Proliant DL380 G7 server.

During boot, the following messages show up in the console:

[63.063844] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
[63.063895] pcc_cpufreq_init: Try to enable another scaling driver through BIOS settings
[63.063943] pcc_cpufreq_init: and complain to the system vendor

According to Andreas Herrmann, the settings can be defined in the HP server BIOS:

Power Management -> Advanced Power Options -> Collaborative Power Control = enabled

This is active (is the default I believe). The Power Regulator is set to "Dynamic Power Savings Mode".

After these messages show up on the console, no login prompt appears. No network started. The server seems frozen - doesn't even react to CTRL+ALT+DEL on the console anymore. Not sure if this is caused by cpufreq or something else though.

This boot problem happened on 2 out of 3 server boots.

Is this a bug in Bullseye?

thx for any hints.


Reply to: