[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1035938: KVM guests did not boot properly after upgrade from 11.6 to 11.7



Package: linux-image-amd64
Version: 5.10.178-3

we have a small fleet of Debian servers, some of them are KVM guests.
After an upgrade from 11.6 to 11.7 some of them did not boot up properly
or at all. We narrow down the problem to (probably) the 5.10.0-22-amd64
kernel, because servers boot properly with the 5.10.0-21-amd64 kernel.
All misbehaving machines are KVM guests.

Additionally, the problem seems to be related to only one of our
KVM hypervisors (one server), which runs standard Debian Bullseye
(not yet upgraded to 11.7), but with 6.1.0-0.deb11.5-amd64 kernel
from backports.

A short simple description of the situation:

* What worked: fully updated Debian 11.6 (5.10.0-21-amd64)
* What is not working: fully updated Debian 11.7 (5.10.0-22-amd64)
* What is working: fully updated Debian 11.7 with the previous kernel
  (5.10.0-21-amd64)

Expected behaviour: a working fully updated Debian 11.7
                    with 5.10.0-22-amd64 kernel

In the logs we found a lot of segfaults related to libc/ld. I'm pasting
them below. This is from only one machine, other guests behave
similarly.


Hypervisor details:
* Linux tor 6.1.0-0.deb11.5-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.12-1~bpo11+1 (2023-03-05) x86_64 GNU/Linux
* AMD EPYC 75F3 32-Core Processor
* ii libc6:amd64 2.31-13+deb11u5 amd64 GNU C Library: Shared libraries


Guest libc6 details (KVM guest):
# dpkg -l | grep libc6
ii libc6:amd64 2.31-13+deb11u6 amd64 GNU C Library: Shared libraries ii libc6-dev:amd64 2.31-13+deb11u6 amd64 GNU C Library: Development Libraries and Header Files


Reviewing all the logs did not show any reason for such behaviour,
segfaulst are staring to appear at different moments without apparent
reason. Switching back to the previous kernel (5.10.0-21-amd64)
resolves the issue. Hypervisor logs also did not show anything that
could suggest what the problem is.


May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.672812] (mount)[209]: segfault at 7f7083027068 ip 00007f7082e8c250 sp 00007ffc276302d8 err
or 25 in libsystemd-shared-247.so[7f7082dec000+18d000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.672822] Code: Unable to access opcode bytes at RIP 0x7f7082e8c226. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.720393] fuse: init (API version 7.32) May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.720569] xfs filesystem being remounted at / supports timestamps until 2038 (0x7fffffff) May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.746690] loadkeys[239]: segfault at 7fff01beebf8 ip 00007f8eb43b23d6 sp 00007fff01beebf8 er
ror 25 in libc-2.31.so[7f8eb42e9000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.746695] Code: Unable to access opcode bytes at RIP 0x7f8eb43b23ac. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778082] systemd-udevd[279]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778087] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778352] systemd-udevd[283]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778355] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778825] systemd-udevd[288]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778828] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782195] systemd-udevd[290]: segfault at 7ffee57897f8 ip 00007f1cfcbf7d1e sp 00007ffee57897f8 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782198] Code: Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782729] systemd-udevd[271]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782732] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.784499] systemd-udevd[278]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.784504] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.785340] systemd-udevd[285]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp 00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.785344] Code: Unable to access opcode bytes at RIP 0x7f1cfcbc39e8. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.802642] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input4 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.837995] ACPI: Power Button [PWRF] May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.856081] input: PC Speaker as /devices/platform/pcspkr/input/input5 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.858827] sr 0:0:0:0: Attached scsi generic sg0 type 5 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.873806] pstore: Using crash dump compression: deflate May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.873810] pstore: Registered efi as persistent store backend May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.874814] iTCO_vendor_support: vendor-support=0 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.876912] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.876965] iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0660) May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.878544] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0) May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.880777] cryptd: max_cpu_qlen set to 1000 May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.898590] AVX2 version of gcm_enc/dec engaged. May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.898591] AES CTR mode by8 optimization enabled May 3 23:19:46 gitlab-runner-portalprod kernel: [ 2.113861] pcieport 0000:00:01.6: pciehp: Slot(0-6): No device found May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018119] show_signal_msg: 14 callbacks suppressed May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018123] systemd-udevd[274]: segfault at 7ffee57897f8 ip 00007f1cfcbf7d1e sp 00007ffee57897f8 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018134] Code: Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4. May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.019739] systemd-udevd[296]: segfault at 7ffee5789848 ip 00007f1cfcbf7d1e sp 00007ffee5789848 error 25 in libc-2.31.so[7f1cfcb1e000+159000] May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.019764] Code: Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4. May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.854995] systemctl[527]: segfault at 7ffd14e78928 ip 00007f9edaedaa13 sp 00007ffd14e788f0 error 25 in libc-2.31.so[7f9edae0d000+159000] May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855007] Code: Unable to access opcode bytes at RIP 0x7f9edaeda9e9. May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855253] dbus-daemon[433]: segfault at 7ffc1aa74a28 ip 00007f1fec03ed1e sp 00007ffc1aa74a28 error 25 in libc-2.31.so[7f1febf65000+159000] May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855261] Code: Unable to access opcode bytes at RIP 0x7f1fec03ecf4. May 3 23:20:29 gitlab-runner-portalprod kernel: [ 44.834159] pager[544]: segfault at 7ffc88996ea8 ip 00007f65c20003d6 sp 00007ffc88996ea8 error 25 in libc-2.31.so[7f65c1f37000+159000] May 3 23:20:29 gitlab-runner-portalprod kernel: [ 44.834170] Code: Unable to access opcode bytes at RIP 0x7f65c20003ac. May 3 23:21:12 gitlab-runner-portalprod kernel: [ 88.221613] run-parts[554]: segfault at 7ffe50526628 ip 00007fb22417370e sp 00007ffe50526628 error 25 in libc-2.31.so[7fb2240ce000+159000] May 3 23:21:12 gitlab-runner-portalprod kernel: [ 88.221622] Code: Unable to access opcode bytes at RIP 0x7fb2241736e4.



May 11 13:16:03 portal-prod kernel: nft[256]: segfault at 7fc86b408c38 ip 00007fc86b4321ef sp 00007ffcc1dc1e08 error 27 in ld-2.31.so[7fc86b4130>


During this particular boot the first problem was the "nft" segfault,
then serveral others appeard and machine was unusable. I saw a kernel
panic once, but cannot reproduce the behaviour now.


maj 11 13:16:03 portal-prod systemd[1]: Starting Load Kernel Module fuse...
maj 11 13:16:03 portal-prod systemd[1]: Starting nftables...
maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped. maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in File System Check on Root Device being skipped. maj 11 13:16:03 portal-prod kernel: nft[256]: segfault at 7fc86b408c38 ip 00007fc86b4321ef sp 00007ffcc1dc1e08 error 27 in ld-2.31.so[7fc86b4130>
maj 11 13:16:03 portal-prod systemd[1]: Starting Journal Service...
maj 11 13:16:03 portal-prod kernel: Code: Unable to access opcode bytes at RIP 0x7fc86b4321c5.
maj 11 13:16:03 portal-prod kernel: fuse: init (API version 7.32)
maj 11 13:16:03 portal-prod systemd[1]: Starting Load Kernel Modules...
maj 11 13:16:03 portal-prod systemd[1]: Starting Remount Root and Kernel File Systems... maj 11 13:16:03 portal-prod systemd[1]: Starting Coldplug All udev Devices...
maj 11 13:16:03 portal-prod systemd[1]: Mounted Huge Pages File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted POSIX Message Queue File System. maj 11 13:16:03 portal-prod kernel: EXT4-fs (vda3): re-mounted. Opts: errors=remount-ro
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Debug File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Trace File System.
maj 11 13:16:03 portal-prod systemd[1]: Finished Create list of static device nodes for the current kernel. maj 11 13:16:03 portal-prod systemd[1]: modprobe@configfs.service: Succeeded. maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module configfs.
maj 11 13:16:03 portal-prod systemd[1]: modprobe@drm.service: Succeeded.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module drm.
maj 11 13:16:03 portal-prod systemd[1]: modprobe@fuse.service: Succeeded.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module fuse.
maj 11 13:16:03 portal-prod systemd[1]: nftables.service: Main process exited, code=killed, status=11/SEGV maj 11 13:16:03 portal-prod systemd[1]: nftables.service: Failed with result 'signal'.
maj 11 13:16:03 portal-prod systemd[1]: Failed to start nftables.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Modules.
maj 11 13:16:03 portal-prod systemd[1]: Finished Remount Root and Kernel File Systems.
maj 11 13:16:03 portal-prod systemd[1]: Reached target Network (Pre).
maj 11 13:16:03 portal-prod systemd[1]: Mounting FUSE Control File System...
maj 11 13:16:03 portal-prod systemd[1]: Mounting Kernel Configuration File System... maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
maj 11 13:16:03 portal-prod systemd[1]: Starting Apply Kernel Variables...
maj 11 13:16:03 portal-prod systemd[1]: Starting Create System Users...
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Configuration File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted FUSE Control File System.
maj 11 13:16:03 portal-prod systemd[1]: systemd-sysctl.service: Main process exited, code=killed, status=11/SEGV maj 11 13:16:03 portal-prod systemd[1]: systemd-sysctl.service: Failed with result 'signal'. maj 11 13:16:03 portal-prod systemd[1]: Failed to start Apply Kernel Variables.


Another boot got to the login page, with several segfaults and no
network, but I was thrown out to the login screen after a moment.


[    2.095923] systemd[1]: Finished Load Kernel Modules.
[ 2.097033] modprobe[250]: segfault at 7fffcb38df60 ip 00007faa724aabda sp 00007fffcb38df50 error 27 in ld-2.31.so[7faa724a1000+20000]
[    2.098711] Code: Unable to access opcode bytes at RIP 0x7faa724aabb0.
[    2.099747] systemd[1]: Mounting Kernel Configuration File System...
[    2.101174] systemd[1]: Starting Apply Kernel Variables...
[    2.101995] systemd[1]: modprobe@fuse.service: Succeeded.
[    2.102743] systemd[1]: Finished Load Kernel Module fuse.
[ 2.103471] systemd[1]: Condition check resulted in FUSE Control File System being skipped.
[    2.105732] systemd[1]: Mounted Kernel Configuration File System.
[    2.107672] EXT4-fs (vda3): re-mounted. Opts: errors=remount-ro
[    2.109296] systemd[1]: Finished Remount Root and Kernel File Systems.
[ 2.110564] systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
[    2.112033] systemd[1]: Starting Create System Users...
[    2.117134] systemd[1]: Finished Apply Kernel Variables.
[ 2.117844] systemd[1]: systemd-sysusers.service: Main process exited, code=killed, status=11/SEGV [ 2.118923] systemd[1]: systemd-sysusers.service: Failed with result 'signal'.
[    2.119857] systemd[1]: Failed to start Create System Users.
[    2.120900] systemd[1]: Starting Create Static Device Nodes in /dev...
[    2.121841] systemd[1]: modprobe@drm.service: Succeeded.
[    2.122546] systemd[1]: Finished Load Kernel Module drm.
[    2.137334] systemd[1]: Started Journal Service.
[ 2.170459] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
[    2.185042] ACPI: Power Button [PWRF]
[ 2.186245] systemctl[302]: segfault at 7f60950bda66 ip 00007f60951e4bee sp 00007ffd6fc41020 error 25 in libc-2.31.so[7f60950d2000+159000]
[    2.188023] Code: Unable to access opcode bytes at RIP 0x7f60951e4bc4.
[    2.202115] input: PC Speaker as /devices/platform/pcspkr/input/input5
[    2.203174] sr 0:0:0:0: Attached scsi generic sg0 type 5
[    2.205279] pstore: Using crash dump compression: deflate
[    2.205665] pstore: Registered efi as persistent store backend
[ 2.210254] modprobe[322]: segfault at 7ffc6a6d5fc8 ip 00007fabda493093 sp 00007ffc6a6d5fd0 error 27 in ld-2.31.so[7fabda493000+20000]
[    2.211566] Code: Unable to access opcode bytes at RIP 0x7fabda493069.
[ 2.217067] Adding 1952764k swap on /dev/vda2. Priority:-2 extents:1 across:1952764k FS [ 2.221668] systemd-sysuser[331]: segfault at 7f4a6768e090 ip 00007f4a683291ef sp 00007fff92a6dc58 error 27 in ld-2.31.so[7f4a6830a000+20000] [ 2.222233] setfont[332]: segfault at 7ffe6ed25fe8 ip 00007f3807b173d6 sp 00007ffe6ed25fe8 error 25 in libc-2.31.so[7f3807a4e000+159000]
[    2.222686] Code: Unable to access opcode bytes at RIP 0x7f4a683291c5.
[    2.223633] Code: Unable to access opcode bytes at RIP 0x7f3807b173ac.
[    2.225057] iTCO_vendor_support: vendor-support=0
[    2.225520] cryptd: max_cpu_qlen set to 1000
[    2.226497] fuse: init (API version 7.32)
[    2.228231] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    2.228728] iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0660)
[    2.229398] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[    2.232657] AVX2 version of gcm_enc/dec engaged.
[    2.233042] AES CTR mode by8 optimization enabled
[ 2.287356] EXT4-fs (vdc1): mounted filesystem with ordered data mode. Opts: (null) [ 2.288381] EXT4-fs (vdb1): mounted filesystem with ordered data mode. Opts: (null) [ 2.293006] systemd-journald[255]: Received client request to flush runtime journal. [ 2.301747] FAT-fs (vda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. [ 2.304947] systemd-journal[255]: segfault at 7f1513dd9e00 ip 00007f15157e8683 sp 00007fff6feda320 error 25 in libsystemd-shared-247.so[7f151567a000+18d000]
[    2.306036] Code: Unable to access opcode bytes at RIP 0x7f15157e8659.
[ 2.307205] systemd[1]: systemd-journal-flush.service: Main process exited, code=exited, status=1/FAILURE [ 2.308002] systemd[1]: systemd-journal-flush.service: Failed with result 'exit-code'.


Kind regards,
--
Kamil Wilczek [https://keys.openpgp.org/]
[6C4BE20A90A1DBFB3CBE2947A832BF5A491F9F2A]


Reply to: