Bug#741239: task kjournald blocked for more than 120 seconds (very slow I/O, reboot required)
Package: linux-image-3.2.0-4-amd64
Version: 3.2.51-1
I had an image file mounted loopback. A few minutes after running rsync on
the mountpoint some bug was triggered and all I/O became extremely slow
after this:
Mar 9 11:21:20 st1 kernel: [1702551.406424] INFO: task kjournald:8705
blocked for more than 120 seconds.
Mar 9 11:21:20 st1 kernel: [1702551.406428] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 9 11:21:20 st1 kernel: [1702551.406431] kjournald D
ffff88041fb93780 0 8705 2 0x00000000
Mar 9 11:21:20 st1 kernel: [1702551.406436] ffff880403ff2fe0
0000000000000046 ffffffff00000000 ffff88040f7a4f60
Mar 9 11:21:20 st1 kernel: [1702551.406442] 0000000000013780
ffff880406069fd8 ffff880406069fd8 ffff880403ff2fe0
Mar 9 11:21:20 st1 kernel: [1702551.406446] ffffffff810135d2
0000000181066245 ffff8804048144b0 ffff88041fb93fd0
Mar 9 11:21:20 st1 kernel: [1702551.406451] Call Trace:
Mar 9 11:21:20 st1 kernel: [1702551.406459] [<ffffffff810135d2>] ?
read_tsc+0x5/0x14
Mar 9 11:21:20 st1 kernel: [1702551.406464] [<ffffffff8111d315>] ?
wait_on_buffer+0x28/0x28
Mar 9 11:21:20 st1 kernel: [1702551.406469] [<ffffffff8134e141>] ?
io_schedule+0x59/0x71
Mar 9 11:21:20 st1 kernel: [1702551.406473] [<ffffffff8111d31b>] ?
sleep_on_buffer+0x6/0xa
Mar 9 11:21:20 st1 kernel: [1702551.406476] [<ffffffff8134e584>] ?
__wait_on_bit+0x3e/0x71
Mar 9 11:21:20 st1 kernel: [1702551.406480] [<ffffffff8134e626>] ?
out_of_line_wait_on_bit+0x6f/0x78
Mar 9 11:21:20 st1 kernel: [1702551.406483] [<ffffffff8111d315>] ?
wait_on_buffer+0x28/0x28
Mar 9 11:21:20 st1 kernel: [1702551.406489] [<ffffffff8105fcad>] ?
autoremove_wake_function+0x2a/0x2a
Mar 9 11:21:20 st1 kernel: [1702551.406497] [<ffffffffa05b156c>] ?
journal_commit_transaction+0x912/0xdce [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406502] [<ffffffff81070fc1>] ?
arch_local_irq_save+0x11/0x17
Mar 9 11:21:20 st1 kernel: [1702551.406506] [<ffffffff8134f209>] ?
_raw_spin_lock_irqsave+0x9/0x25
Mar 9 11:21:20 st1 kernel: [1702551.406510] [<ffffffff8134f247>] ?
_raw_spin_unlock_irqrestore+0xe/0xf
Mar 9 11:21:20 st1 kernel: [1702551.406516] [<ffffffffa05b4643>] ?
kjournald+0xe0/0x21e [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406521] [<ffffffff8105fc83>] ?
add_wait_queue+0x3c/0x3c
Mar 9 11:21:20 st1 kernel: [1702551.406526] [<ffffffffa05b4563>] ?
commit_timeout+0x5/0x5 [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406530] [<ffffffff8105f631>] ?
kthread+0x76/0x7e
Mar 9 11:21:20 st1 kernel: [1702551.406535] [<ffffffff81356374>] ?
kernel_thread_helper+0x4/0x10
Mar 9 11:21:20 st1 kernel: [1702551.406540] [<ffffffff8105f5bb>] ?
kthread_worker_fn+0x139/0x139
Mar 9 11:21:20 st1 kernel: [1702551.406544] [<ffffffff81356370>] ?
gs_change+0x13/0x13
Mar 9 11:21:20 st1 kernel: [1702551.406548] INFO: task rsync:4668 blocked
for more than 120 seconds.
Mar 9 11:21:20 st1 kernel: [1702551.406550] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 9 11:21:20 st1 kernel: [1702551.406552] rsync D
ffff88041fb13780 0 4668 8750 0x00000004
Mar 9 11:21:20 st1 kernel: [1702551.406557] ffff880404a9e340
0000000000000086 ffff880100000000 ffff88040f775610
Mar 9 11:21:20 st1 kernel: [1702551.406561] 0000000000013780
ffff880406293fd8 ffff880406293fd8 ffff880404a9e340
Mar 9 11:21:20 st1 kernel: [1702551.406566] 0000000000000246
000000018134f209 ffff88041fdd5c18 ffff8803c84a7978
Mar 9 11:21:20 st1 kernel: [1702551.406570] Call Trace:
Mar 9 11:21:20 st1 kernel: [1702551.406575] [<ffffffffa05b0513>] ?
do_get_write_access+0x1ad/0x367 [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406580] [<ffffffff8105fcad>] ?
autoremove_wake_function+0x2a/0x2a
Mar 9 11:21:20 st1 kernel: [1702551.406589] [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406594] [<ffffffffa05b07e5>] ?
journal_get_write_access+0x21/0x38 [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406600] [<ffffffffa05cac8f>] ?
__ext3_get_inode_loc+0xf4/0x2cf [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406607] [<ffffffffa05dd459>] ?
__ext3_journal_get_write_access+0x1c/0x49 [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406614] [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406620] [<ffffffffa05cc2ed>] ?
ext3_reserve_inode_write+0x39/0x7b [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406625] [<ffffffffa05af372>] ?
start_this_handle+0x2b8/0x31c [jbd]
Mar 9 11:21:20 st1 kernel: [1702551.406631] [<ffffffffa05cc389>] ?
ext3_mark_inode_dirty+0x5a/0x79 [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406638] [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406645] [<ffffffffa05df2c8>] ?
ext3_xattr_set_acl+0xdb/0x11d [ext3]
Mar 9 11:21:20 st1 kernel: [1702551.406650] [<ffffffff811130ce>] ?
generic_setxattr+0x69/0x6f
Mar 9 11:21:20 st1 kernel: [1702551.406654] [<ffffffff8111375d>] ?
__vfs_setxattr_noperm+0x66/0xcd
Mar 9 11:21:20 st1 kernel: [1702551.406657] [<ffffffff811138e5>] ?
setxattr+0x95/0xf0
Mar 9 11:21:20 st1 kernel: [1702551.406662] [<ffffffff81036628>] ?
should_resched+0x5/0x23
Mar 9 11:21:20 st1 kernel: [1702551.406666] [<ffffffff81113831>] ?
vfs_setxattr+0x6d/0x8c
Mar 9 11:21:20 st1 kernel: [1702551.406669] [<ffffffff81036628>] ?
should_resched+0x5/0x23
Mar 9 11:21:20 st1 kernel: [1702551.406672] [<ffffffff8111390d>] ?
setxattr+0xbd/0xf0
Mar 9 11:21:20 st1 kernel: [1702551.406678] [<ffffffff810eb124>] ?
kmem_cache_free+0x2d/0x69
Mar 9 11:21:20 st1 kernel: [1702551.406683] [<ffffffff81105d33>] ?
user_path_at_empty+0x53/0x7b
Mar 9 11:21:20 st1 kernel: [1702551.406688] [<ffffffff81113aed>] ?
sys_setxattr+0x6a/0x8f
Mar 9 11:21:20 st1 kernel: [1702551.406692] [<ffffffff81354212>] ?
system_call_fastpath+0x16/0x1b
The process 'rsync' became unkillable. These kernel messages continued for
the next 10 minutes or so with some variations.
Even after the messages stopped, I/O was still extremely slow... freezing up
all the time. With slow/freezing I/O I don't mean just to this mountpoint
(which actually seemed totally frozen) but also to/from the other disks.
This box is acting as a NFS server. Remote systems saw a continuing 'nfs:
server .. not respondig'.. followed by 'nfs: .. OK' a while later, and then
not responding again.
The exact situation was:
(rsync on..) ext3 -> loopback/DM -> ext4 -> DRBD -> RAID -> disks
I used kpartx to create loopback/DM devices based on an .img file which was
mounted in /data (ext4 on DRBD), then mounted /dev/mapper/loop1p1 on /mnt
(ext3), and ran an rsync on this /mnt.
All data was previously synchronized without ACL/EA's and I re-ran rsync
with -AX added. This means the rsync job was mostly synchronizing ACL's and
Extended Attributes, maybe that is a significant clue.
The machine has 16GB of RAM, of which typically 15G is available for
caching/buffers. At the time of the hang this was also the case, I checked
with 'free'.
Some possibly relevant vm sysctl settings:
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 0
vm.dirty_bytes = 15000000
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
As you can see dirty bytes is very low (just 15M), so that rules out
kjournald blocking due to a very large dirty buffer.
Hopefully this report will help. Of course, if this is a known issue and
there's a solution available then I'd be eager to hear about it.
If you need any additional details or if I misfiled this bug, let me know.
ii linux-image-3.2.0-4-amd64 3.2.51-1 amd64
Linux 3.2 for 64-bit PCs
Reply to: