[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

QEMU-KVM VMs sometime freeze when I run them for a couple of days



Hi Debian people ;-),

After having some issues with Fedora last year I decided to reinstall all my servers to Debian 10. I'm supper happy with Debian except one repeating issue I have with QEMU-KVM hosts that is very difficult to reproduce so I would like to discuss it first before I open a new bug. Could you please discuss it with me? ;-)

I noticed that when I run VMs for a long period of time (a couple of days) one or multiple VMs quite often stuck. It is not possible to connect the stuck VMs using virt-manager and their serial consoles don't respond.

It is not possible to shut them down ("virsh shutdown vm"). Sometimes the stuck VMs can be powered down ("virsh destroy vm") but in most cases "virsh destroy" doesn't work. In that case the only thing to do is to shut down rest of running VMs (that do respond) and reboot the host.

This is a kernel message I get in console that tells me that one or multiple VMs are stuck:
~~~
[686811.010084] INFO: task CPU 0/KVM:12193 blocked for more than 120 seconds.
[686811.017040]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.024777] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.033012] INFO: task CPU 1/KVM:12194 blocked for more than 120 seconds.
[686811.039921]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.047606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.055803] INFO: task worker:5952 blocked for more than 120 seconds.
[686811.062355]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.070048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.078059] INFO: task worker:7667 blocked for more than 120 seconds.
[686811.084618]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.092306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.100296] INFO: task worker:4903 blocked for more than 120 seconds.
[686811.106849]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.114530] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.122512] INFO: task worker:4905 blocked for more than 120 seconds.
[686811.129068]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.136765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.144771] INFO: task worker:4920 blocked for more than 120 seconds.
[686811.151328]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.159009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686811.167009] INFO: task worker:7328 blocked for more than 120 seconds.
[686811.173576]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686811.181256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686931.842071] INFO: task CPU 0/KVM:12193 blocked for more than 120 seconds.
[686931.849028]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686931.856764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[686931.864997] INFO: task CPU 1/KVM:12194 blocked for more than 120 seconds.
[686931.871908]       Tainted: P           OE     4.19.0-12-amd64 #1 Debian 4.19.152-1
[686931.879586] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
~~~

When I reboot/shutdown the host the reboot/shutdown takes approx. 30min.

This is how it looks like during the reboot / shutdown:
~~~
...
[1051413.325604] libvirt-guests.sh[10107]: error: Failed to shutdown domain de763fd3-043c-4f6f-b7f9-e134907b9f54
[1051413.325964] libvirt-guests.sh[10107]: error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainShutdown agent=remoteDispatchDomainShutdown)
...
[1053290.120617] reboot: Power down
~~~

Or systemd waits for LUKS encrypted block storage which doesn't make sense since LUKS is used only on SSD with the host OS and VMs run from a different SSDs.
~~~
[  OK  ] Stopped target Local File Systems (Pre).
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
         Stopping Monitoring of LVM_meventd or progress polling...
[  OK  ] Stopped Monitoring of LVM2_ dmeventd or progress polling.
[FAILED] Failed deactivating swap swp.
         Stopping Cryptography Setup for swapluks1...
[  OK  ] Stopped Cryptography Setup for swapluks1.
         Stopping Load/Save Random Seed...
[  OK  ] Removed slice system-systemd\x2dcryptsetup.slice.
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Reached target Shutdown.
[   ***] (1 of 4) A stop job is running for /dev/dm-1 (18min 6s / no limit)
~~~
and then
~~~
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Reached target Shutdown.
[ TIME ] Timed out starting Reboot.
[  !!  ] Forcibly rebooting: job timed out
[415391.907610] watchdog: watchdog0: watchdog did not stop!
[415578.747792] reboot: Restarting system
~~~

As I mentioned it is very difficult to reproduce it since it takes days to get into that situation. VMs that are more likely to get stuck are VMs that:

a) have larger virtual disks
b) more intensive storage use (use more IOPs)
c) have more vCPUs

The problem is that VMs with larger disks usually use more IOPs and also have more vCPUs so it is difficult to say what exactly is the issue. Based on my testing I thing that less vCPUs makes it less likely to get stuck but it's difficult to say...

The only thing I'm confident is that the problem is not HW related - it happened both on my SuperMicro with XEON E5 v2 and on other hardware with Intel i7 7th gen.

>From the configuration/software perspective all hosts and all VMs have up-to-date Debian 10. The only less usual piece of software is ZFS 0.8 from buster-backports. All VMs use ZFS volumes (similar to RAW disks).

I super confident using ZFS since I use it since version 0.7 absolutely everywhere including this Debian 10 HP laptop that I use to write this text.

When some of the VMs stuck/freeze the other VMs run just fine and ZFS is functional and fast and I don't see any errors (in "zpool status" after I run "zpool scrub") so I don't think it's ZFS but I want to mention it since it's the only unusual piece of SW/configuration I use.

Btw. this has never happened on my laptop that has same configuration as the server (+Desktop Env.) but I reboot it multiple time a week so that might be an answer...

Please let me know your thoughts. Thank you ;-).

Merry Christmas and Happy New Year ;-).

Kind regards,

Robert Hrach


Reply to: