[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1034676: linux-image-5.10.0-20-rt-amd64: RT-Kernel freeze during IO load



Control: tags -1 + moreinfo

Hi Christian,

On Fri, Apr 21, 2023 at 02:21:19PM +0200, Christian Barth wrote:
> Subject: linux-image-5.10.0-20-rt-amd64: RT-Kernel freeze during IO load
> Package: linux-image-5.10.0-20-rt-amd64
> X-Debbugs-Cc: C.Barth@goepel.com
> Version: 5.10.158-2
> Severity: normal
> 
> Dear Maintainer,
> 
> I have a problem with an application that leads to a
> possible kernel freeze.
> We tried to split the problem, but unfortunately we did not succeed.
> 
> We were able to recreate a system freeze after about 3-15h using simple
> echo
> transfers over 6 PCIe cards. We ran our PCIe driver without the API part
> and
> with a simple echo test several times with no problem for more than 60h.
> On the other hand, we tested the API part by simulating the interface to
> the
> driver using predefined responses.
> 
> As you can see in the logs the first error seems to be always the same and
> has to do with messagequeues (our server application uses mq as ipc).
> Sometimes seconds or hours later there are more errors like in the 2nd
> log.
> 
> Kind regards,
> Christian
> 
> 
> [42546.217248] WARNING: CPU: 5 PID: 1717 at kernel/sched/core.c:2500
> set_task_cpu+0x1cd/0x1e0
> [42546.217255] Modules linked in: udp_diag tcp_diag inet_diag Pxi62xx(OE)
> rfkill binfmt_misc intel_rapl_msr intel_rapl_common joydev
> snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp dell_smm_hwmon
> snd_hda_codec_realtek coretemp snd_hda_codec_generic ledtrig_audio
> kvm_intel kvm irqbypass dell_wmi ghash_clmulni_intel snd_hda_intel
> snd_intel_dspcfg aesni_intel soundwire_intel libaes
> soundwire_generic_allocation crypto_simd cryptd glue_helper snd_soc_core
> rapl snd_compress soundwire_cadence intel_cstate snd_hda_codec dell_smbios
> snd_hda_core dcdbas snd_hwdep intel_uncore sparse_keymap soundwire_bus
> dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt mei_hdcp mei_wdt
> iTCO_wdt snd_pcm intel_pmc_bxt ee1004 snd_timer iTCO_vendor_support
> serio_raw pcspkr sg watchdog snd soundcore mei_me mei intel_pch_thermal
> evdev intel_pmc_core acpi_pad parport_pc ppdev lp parport fuse configfs
> ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic
> raid10 raid456 async_raid6_recov
> [42546.217314]  async_memcpy async_pq async_xor async_tx xor raid6_pq
> libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_microsoft
> ff_memless hid_generic sd_mod t10_pi crc_t10dif crct10dif_generic usbhid
> hid i915 ahci i2c_algo_bit xhci_pci drm_kms_helper libahci xhci_hcd
> crct10dif_pclmul crct10dif_common cec e1000e libata crc32_pclmul
> crc32c_intel psmouse i2c_i801 ptp usbcore drm pps_core scsi_mod i2c_smbus
> usb_common wmi fan video button
> [42546.217346] CPU: 5 PID: 1717 Comm: EchoTest Tainted: G           OE
> 5.10.0-21-rt-amd64 #1 Debian 5.10.162-1
> [42546.217349] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS
> 2.18.0 06/10/2021
> [42546.217350] RIP: 0010:set_task_cpu+0x1cd/0x1e0
> [42546.217353] Code: c9 0f 84 a0 fe ff ff 8b 43 68 83 f8 02 0f 84 94 fe ff
> ff 0f 0b e9 8d fe ff ff 0f 0b 66 83 bb c0 07 00 00 00 0f 84 9c fe ff ff
> <0f> 0b e9 95 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44
> [42546.217356] RSP: 0018:ffffb96002873d48 EFLAGS: 00010002
> [42546.217358] RAX: 0000000000000000 RBX: ffff9245595d8000 RCX:
> 0000000000000000
> [42546.217359] RDX: ffff924c0dcf1278 RSI: 0000000000000002 RDI:
> ffff9245595d8000
> [42546.217361] RBP: 0000000000000002 R08: ffff924c0dcf1278 R09:
> 0000000000000000
> [42546.217362] R10: 0000000000000001 R11: 0000000000000000 R12:
> 0000000000000002
> [42546.217363] R13: ffff9245595d8028 R14: 0000000000000000 R15:
> ffff924c0dcf09c0
> [42546.217365] FS:  00007f97fa35b740(0000) GS:ffff924c0dd40000(0000)
> knlGS:0000000000000000
> [42546.217367] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [42546.217369] CR2: 00007fcbfc13d000 CR3: 0000000138674005 CR4:
> 00000000003706e0
> [42546.217370] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [42546.217371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [42546.217373] Call Trace:
> [42546.217377]  push_rt_task.part.0+0x1bf/0x420
> [42546.217382]  task_woken_rt+0x5d/0x70
> [42546.217385]  ttwu_do_wakeup+0x45/0x190
> [42546.217388]  try_to_wake_up+0x19b/0x6b0
> [42546.217391]  __wake_up_q+0x99/0xd0
> [42546.217395]  do_mq_timedsend+0x23b/0x410
> [42546.217400]  __x64_sys_mq_timedsend+0x70/0xb0
> [42546.217403]  ? exit_to_user_mode_prepare+0x33/0x1a0
> [42546.217407]  do_syscall_64+0x30/0x80
> [42546.217410]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> [42546.217414] RIP: 0033:0x7f97fa698040
> [42546.217417] Code: 89 4c 24 14 e8 61 04 00 00 4c 8b 44 24 18 44 8b 54 24
> 14 41 89 c1 48 8b 54 24 08 48 8b 34 24 b8 f2 00 00 00 8b 7c 24 10 0f 05
> <48> 3d 00 f0 ff ff 77 30 44 89 cf 89 44 24 10 e8 8c 04 00 00 8b 44
> [42546.217418] RSP: 002b:00007ffc369eeb90 EFLAGS: 00000293 ORIG_RAX:
> 00000000000000f2
> [42546.217421] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007f97fa698040
> [42546.217422] RDX: 000000000000003f RSI: 000055b54a7a31c8 RDI:
> 0000000000000008
> [42546.217424] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000000
> [42546.217425] R10: 0000000000000000 R11: 0000000000000293 R12:
> 0000000000000007
> [42546.217426] R13: 000055b54a7b4230 R14: 000000000000002b R15:
> 00007f97fa9167c0
> [42546.217430] ---[ end trace 0000000000000002 ]---
> [42547.207213] sched: RT throttling activated
> 
> 
> [23686.539555] WARNING: CPU: 6 PID: 1811 at kernel/sched/core.c:2500
> set_task_cpu+0x1cd/0x1e0
> [23686.539563] Modules linked in: Pxi62xx(OE) rfkill binfmt_misc joydev
> intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal dell_smm_hwmon
> snd_hda_codec_hdmi intel_powerclamp coretemp snd_hda_codec_realtek
> snd_hda_codec_generic ledtrig_audio kvm_intel kvm irqbypass dell_wmi
> ghash_clmulni_intel mei_hdcp mei_wdt aesni_intel snd_hda_intel libaes
> snd_intel_dspcfg crypto_simd cryptd soundwire_intel glue_helper
> soundwire_generic_allocation rapl snd_soc_core snd_compress intel_cstate
> soundwire_cadence dell_smbios dcdbas snd_hda_codec iTCO_wdt serio_raw
> sparse_keymap dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt
> intel_uncore snd_hda_core intel_pmc_bxt snd_hwdep iTCO_vendor_support
> ee1004 pcspkr sg watchdog soundwire_bus snd_pcm snd_timer snd mei_me
> soundcore mei intel_pch_thermal evdev acpi_pad intel_pmc_core parport_pc
> ppdev lp parport fuse configfs ip_tables x_tables autofs4 ext4 crc16
> mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor
> [23686.539604]  async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0
> multipath linear md_mod hid_microsoft ff_memless hid_generic sd_mod t10_pi
> crc_t10dif crct10dif_generic usbhid hid i915 i2c_algo_bit xhci_pci
> drm_kms_helper ahci xhci_hcd libahci cec libata crct10dif_pclmul
> crct10dif_common crc32_pclmul e1000e usbcore crc32c_intel i2c_i801 ptp drm
> psmouse scsi_mod pps_core i2c_smbus usb_common wmi fan video button
> [23686.539625] CPU: 6 PID: 1811 Comm: EchoTest Tainted: G           OE
> 5.10.0-21-rt-amd64 #1 Debian 5.10.162-1
> [23686.539627] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS
> 2.18.0 06/10/2021
> [23686.539628] RIP: 0010:set_task_cpu+0x1cd/0x1e0
> [23686.539630] Code: c9 0f 84 a0 fe ff ff 8b 43 68 83 f8 02 0f 84 94 fe ff
> ff 0f 0b e9 8d fe ff ff 0f 0b 66 83 bb c0 07 00 00 00 0f 84 9c fe ff ff
> <0f> 0b e9 95 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44
> [23686.539632] RSP: 0018:ffffa0f3c2467d48 EFLAGS: 00010002
> [23686.539634] RAX: 0000000000000000 RBX: ffff94b28ed6e400 RCX:
> 0000000000000000
> [23686.539635] RDX: ffff94b68dcf1278 RSI: 0000000000000000 RDI:
> ffff94b28ed6e400
> [23686.539636] RBP: 0000000000000000 R08: ffff94b68dcf1278 R09:
> 0000000000000000
> [23686.539637] R10: 0000000000000001 R11: 0000000000000000 R12:
> 0000000000000000
> [23686.539638] R13: ffff94b28ed6e428 R14: 0000000000000000 R15:
> ffff94b68dcf09c0
> [23686.539639] FS:  00007f4d28db7740(0000) GS:ffff94b68dd80000(0000)
> knlGS:0000000000000000
> [23686.539640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [23686.539641] CR2: 00007f89d826c000 CR3: 0000000189bd2002 CR4:
> 00000000003706e0
> [23686.539642] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [23686.539643] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [23686.539644] Call Trace:
> [23686.539647]  push_rt_task.part.0+0x1bf/0x420
> [23686.539651]  task_woken_rt+0x5d/0x70
> [23686.539652]  ttwu_do_wakeup+0x45/0x190
> [23686.539655]  try_to_wake_up+0x19b/0x6b0
> [23686.539657]  __wake_up_q+0x99/0xd0
> [23686.539659]  do_mq_timedsend+0x23b/0x410
> [23686.539662]  ? pick_next_task_fair+0x182/0x3d0
> [23686.539664]  ? _raw_spin_unlock_irq+0x1d/0x60
> [23686.539667]  ? __schedule+0x824/0x980
> [23686.539670]  __x64_sys_mq_timedsend+0x70/0xb0
> [23686.539672]  do_syscall_64+0x30/0x80
> [23686.539674]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> [23686.539677] RIP: 0033:0x7f4d290f4040
> [23686.539679] Code: 89 4c 24 14 e8 61 04 00 00 4c 8b 44 24 18 44 8b 54 24
> 14 41 89 c1 48 8b 54 24 08 48 8b 34 24 b8 f2 00 00 00 8b 7c 24 10 0f 05
> <48> 3d 00 f0 ff ff 77 30 44 89 cf 89 44 24 10 e8 8c 04 00 00 8b 44
> [23686.539680] RSP: 002b:00007fffbee9b040 EFLAGS: 00000293 ORIG_RAX:
> 00000000000000f2
> [23686.539682] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007f4d290f4040
> [23686.539683] RDX: 0000000000000021 RSI: 000055e8bffa81c8 RDI:
> 0000000000000008
> [23686.539683] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000000
> [23686.539684] R10: 0000000000000000 R11: 0000000000000293 R12:
> 0000000000000007
> [23686.539685] R13: 000055e8bffb9230 R14: 000000000000000d R15:
> 00007f4d293727c0
> [23686.539689] ---[ end trace 0000000000000002 ]---
> [23687.529423] sched: RT throttling activated
> [23687.529423] sched: RT throttling activated
> [23687.529438] ------------[ cut here ]------------
> [23687.529440] DEBUG_LOCKS_WARN_ON(l->nestcnt == 0)
> [23687.529448] WARNING: CPU: 0 PID: 1825 at
> include/linux/local_lock_internal.h:67 refill_obj_stock+0x1ca/0x260
> [23687.529452] Modules linked in: Pxi62xx(OE) rfkill binfmt_misc joydev
> intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal dell_smm_hwmon
> snd_hda_codec_hdmi intel_powerclamp coretemp snd_hda_codec_realtek
> snd_hda_codec_generic ledtrig_audio kvm_intel kvm irqbypass dell_wmi
> ghash_clmulni_intel mei_hdcp mei_wdt aesni_intel snd_hda_intel libaes
> snd_intel_dspcfg crypto_simd cryptd soundwire_intel glue_helper
> soundwire_generic_allocation rapl snd_soc_core snd_compress intel_cstate
> soundwire_cadence dell_smbios dcdbas snd_hda_codec iTCO_wdt serio_raw
> sparse_keymap dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt
> intel_uncore snd_hda_core intel_pmc_bxt snd_hwdep iTCO_vendor_support
> ee1004 pcspkr sg watchdog soundwire_bus snd_pcm snd_timer snd mei_me
> soundcore mei intel_pch_thermal evdev acpi_pad intel_pmc_core parport_pc
> ppdev lp parport fuse configfs ip_tables x_tables autofs4 ext4 crc16
> mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor
> [23687.529504]  async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0
> multipath linear md_mod hid_microsoft ff_memless hid_generic sd_mod t10_pi
> crc_t10dif crct10dif_generic usbhid hid i915 i2c_algo_bit xhci_pci
> drm_kms_helper ahci xhci_hcd libahci cec libata crct10dif_pclmul
> crct10dif_common crc32_pclmul e1000e usbcore crc32c_intel i2c_i801 ptp drm
> psmouse scsi_mod pps_core i2c_smbus usb_common wmi fan video button
> [23687.529538] CPU: 0 PID: 1825 Comm: Srv-RequestT Tainted: G        W  OE
>     5.10.0-21-rt-amd64 #1 Debian 5.10.162-1
> [23687.529540] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS
> 2.18.0 06/10/2021
> [23687.529541] RIP: 0010:refill_obj_stock+0x1ca/0x260
> [23687.529544] Code: 07 ff ff ff e8 77 bd 1b 00 85 c0 74 1f 8b 0d b5 07 61
> 01 85 c9 75 15 48 c7 c6 06 42 ad 89 48 c7 c7 e2 f1 ab 89 e8 eb f0 60 00
> <0f> 0b 8b 15 f6 57 cb 01 85 d2 0f 84 1a ff ff ff e9 28 ff ff ff e8
> [23687.529547] RSP: 0018:ffffa0f3c255fdd0 EFLAGS: 00010286
> [23687.529549] RAX: 0000000000000000 RBX: ffff94b68dcedd60 RCX:
> 0000000000000000
> [23687.529550] RDX: 0000000000000001 RSI: ffffffff89b10c0c RDI:
> 00000000ffffffff
> [23687.529552] RBP: ffff94b68dc2dd60 R08: ffffffff8a063d00 R09:
> 000000008a992eeb
> [23687.529553] R10: ffffffffffffffff R11: 0000000000000001 R12:
> ffff94b68dc2dd60
> [23687.529553] R13: 00000000000017d0 R14: ffff94afa89714c0 R15:
> ffff94af80043a00
> [23687.529555] FS:  00007f2cba7fc700(0000) GS:ffff94b68dc00000(0000)
> knlGS:0000000000000000
> [23687.529556] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [23687.529557] CR2: 00007fcbd7797000 CR3: 00000001aa754004 CR4:
> 00000000003706f0
> [23687.529560] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [23687.529561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [23687.529562] Call Trace:
> [23687.529564]  ? free_msg+0x1b/0x40
> [23687.529566]  kfree+0x382/0x500
> [23687.529572]  free_msg+0x1b/0x40
> [23687.529574]  do_mq_timedreceive+0x370/0x550
> [23687.529578]  __x64_sys_mq_timedreceive+0x70/0xb0
> [23687.529581]  do_syscall_64+0x30/0x80
> [23687.529584]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> [23687.529588] RIP: 0033:0x7f2cd31ce0f3
> [23687.529592] Code: 4c 24 20 e8 af 03 00 00 4c 8b 44 24 28 4c 8b 54 24 20
> 41 89 c1 48 8b 54 24 18 48 8b 74 24 10 b8 f3 00 00 00 8b 7c 24 08 0f 05
> <48> 3d 00 f0 ff ff 77 35 44 89 cf 48 89 44 24 08 e8 d8 03 00 00 48
> [23687.529593] RSP: 002b:00007f2cba7fbdd0 EFLAGS: 00000293 ORIG_RAX:
> 00000000000000f3
> [23687.529595] RAX: ffffffffffffffda RBX: 000000000000000a RCX:
> 00007f2cd31ce0f3
> [23687.529596] RDX: 000000000000800c RSI: 00007f2cc400a530 RDI:
> 000000000000001c
> [23687.529597] RBP: 00007f2cc400a530 R08: 00007f2cba7fbe10 R09:
> 0000000000000000
> [23687.529598] R10: 0000000000000000 R11: 0000000000000293 R12:
> 000000000000800c
> [23687.529599] R13: 000000000000001c R14: 0000000000000000 R15:
> 0000000000000000
> [23687.529603] ---[ end trace 0000000000000003 ]---

If I see it correctly you have as well an out of tree module loaded,
involved. Can you reproduce the issue without that beeing loaded in a
repsective test setup?

5.10.158-2 is not the newest kernel in bullseye, can you reproduce the
issue with 5.10.162-1?

Regards,
Salvatore

The next weekend's point release aims to include as well a rebase to
5.10.178-1.


Reply to: