Bug#841171: linux-image-3.16.0-4-amd64: random hangs during systemd-udev-settle
Package: src:linux
Version: 3.16.36-1+deb8u1
Severity: normal
Hi,
When booting, on about 5% of boots, the system hangs for several minutes
while waiting for systemd-udev-settle to complete. (systemd-udev-settle
is triggered by lvm2)
The log shows:
Oct 07 11:40:59 grisou-6.nancy.grid5000.fr systemd-udevd[461]: worker [517] /devices/system/cpu/cpu13 timeout; kill it
Oct 07 11:40:59 grisou-6.nancy.grid5000.fr systemd-udevd[461]: seq 3533 '/devices/system/cpu/cpu13' killed
Oct 07 11:40:59 grisou-6.nancy.grid5000.fr systemd-udevd[461]: worker [517] terminated by signal 9 (Killed)
And systemd-udev-settle is seen as Failed as it reached the timeout:
# systemctl status systemd-udev-settle.service
● systemd-udev-settle.service - udev Wait for Complete Device Initialization
Loaded: loaded (/lib/systemd/system/systemd-udev-settle.service; static)
Active: failed (Result: timeout) since Thu 2016-10-06 12:46:39 CEST; 1min 57s ago
Docs: man:udev(7)
man:systemd-udevd.service(8)
Process: 456 ExecStart=/bin/udevadm settle (code=killed, signal=TERM)
Main PID: 456 (code=killed, signal=TERM)
It happens on various machines, of various models (all Dell, but I'm not sure
this is relevant as all our recent machines are Dell machines). A hardware
issue is unlikely.
It is fixed in stretch and unstable.
I bisected it, and found that commit 6f942a1f264e875c5f3ad6f505d7b500a3e7fa82
fixed it. That commit is:
commit 6f942a1f264e875c5f3ad6f505d7b500a3e7fa82
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Sep 24 10:18:46 2014 +0200
locking/mutex: Don't assume TASK_RUNNING
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index dadbf88..4541951 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -378,8 +378,14 @@ done:
* reschedule now, before we try-lock the mutex. This avoids getting
* scheduled out right after we obtained the mutex.
*/
- if (need_resched())
+ if (need_resched()) {
+ /*
+ * We _should_ have TASK_RUNNING here, but just in case
+ * we do not, make it so, otherwise we might get stuck.
+ */
+ __set_current_state(TASK_RUNNING);
schedule_preempt_disabled();
+ }
return false;
}
Unfortunately, the code around this was changed after 3.16, making a backport
non-trivial.
A workaround (for jessie systems) is to not install lvm2 if that is an option.
Lucas
-- Package-specific info:
** Version:
Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03)
** Command line:
root=/dev/sda3 console=tty0 console=ttyS0,115200
** Not tainted
** Model information
sys_vendor: Dell Inc.
product_name: PowerEdge R630
product_version:
chassis_vendor: Dell Inc.
chassis_version:
bios_vendor: Dell Inc.
bios_version: 1.3.6
board_vendor: Dell Inc.
board_name: 0CNCJW
board_version: A08
** Loaded modules:
x86_pkg_temp_thermal
intel_powerclamp
ttm
drm_kms_helper
intel_rapl
coretemp
kvm_intel
kvm
crc32_pclmul
aesni_intel
aes_x86_64
lrw
gf128mul
glue_helper
ablk_helper
cryptd
evdev
pcspkr
dcdbas
iTCO_wdt
ipmi_devintf
iTCO_vendor_support
drm
ipmi_si
ipmi_msghandler
mei_me
mei
lpc_ich
shpchp
processor
mfd_core
thermal_sys
wmi
acpi_power_meter
button
autofs4
ext4
crc16
mbcache
jbd2
sg
sd_mod
crc_t10dif
crct10dif_generic
ahci
igb
i2c_algo_bit
ehci_pci
libahci
ixgbe
i2c_core
ehci_hcd
libata
megaraid_sas
dca
crct10dif_pclmul
crct10dif_common
ptp
crc32c_intel
usbcore
pps_core
usb_common
mlx4_core
mdio
scsi_mod
** PCI devices:
not available
** USB devices:
not available
-- System Information:
Debian Release: 8.6
APT prefers stable
APT policy: (500, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 3.16.0-4-amd64 (SMP w/32 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages linux-image-3.16.0-4-amd64 depends on:
ii debconf [debconf-2.0] 1.5.56
ii initramfs-tools [linux-initramfs-tool] 0.120+deb8u2
ii kmod 18-3
ii linux-base 3.5
Versions of packages linux-image-3.16.0-4-amd64 recommends:
pn firmware-linux-free <none>
pn irqbalance <none>
Versions of packages linux-image-3.16.0-4-amd64 suggests:
pn debian-kernel-handbook <none>
ii extlinux 3:6.03+dfsg-5+deb8u1
pn linux-doc-3.16 <none>
Versions of packages linux-image-3.16.0-4-amd64 is related to:
pn firmware-atheros <none>
ii firmware-bnx2 0.43
ii firmware-bnx2x 0.43
pn firmware-brcm80211 <none>
pn firmware-intelwimax <none>
pn firmware-ipw2x00 <none>
pn firmware-ivtv <none>
pn firmware-iwlwifi <none>
pn firmware-libertas <none>
pn firmware-linux <none>
pn firmware-linux-nonfree <none>
pn firmware-myricom <none>
pn firmware-netxen <none>
pn firmware-qlogic <none>
pn firmware-ralink <none>
pn firmware-realtek <none>
pn xen-hypervisor <none>
-- debconf information:
linux-image-3.16.0-4-amd64/postinst/mips-initrd-3.16.0-4-amd64:
linux-image-3.16.0-4-amd64/prerm/removing-running-kernel-3.16.0-4-amd64: true
linux-image-3.16.0-4-amd64/postinst/depmod-error-initrd-3.16.0-4-amd64: false
Reply to: