[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#741239: task kjournald blocked for more than 120 seconds (very slow I/O, reboot required)



Package: linux-image-3.2.0-4-amd64
Version: 3.2.51-1

I had an image file mounted loopback. A few minutes after running rsync on
the mountpoint some bug was triggered and all I/O became extremely slow
after this:

Mar  9 11:21:20 st1 kernel: [1702551.406424] INFO: task kjournald:8705
blocked for more than 120 seconds.
Mar  9 11:21:20 st1 kernel: [1702551.406428] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  9 11:21:20 st1 kernel: [1702551.406431] kjournald       D
ffff88041fb93780     0  8705      2 0x00000000
Mar  9 11:21:20 st1 kernel: [1702551.406436]  ffff880403ff2fe0
0000000000000046 ffffffff00000000 ffff88040f7a4f60
Mar  9 11:21:20 st1 kernel: [1702551.406442]  0000000000013780
ffff880406069fd8 ffff880406069fd8 ffff880403ff2fe0
Mar  9 11:21:20 st1 kernel: [1702551.406446]  ffffffff810135d2
0000000181066245 ffff8804048144b0 ffff88041fb93fd0
Mar  9 11:21:20 st1 kernel: [1702551.406451] Call Trace:
Mar  9 11:21:20 st1 kernel: [1702551.406459]  [<ffffffff810135d2>] ?
read_tsc+0x5/0x14
Mar  9 11:21:20 st1 kernel: [1702551.406464]  [<ffffffff8111d315>] ?
wait_on_buffer+0x28/0x28
Mar  9 11:21:20 st1 kernel: [1702551.406469]  [<ffffffff8134e141>] ?
io_schedule+0x59/0x71
Mar  9 11:21:20 st1 kernel: [1702551.406473]  [<ffffffff8111d31b>] ?
sleep_on_buffer+0x6/0xa
Mar  9 11:21:20 st1 kernel: [1702551.406476]  [<ffffffff8134e584>] ?
__wait_on_bit+0x3e/0x71
Mar  9 11:21:20 st1 kernel: [1702551.406480]  [<ffffffff8134e626>] ?
out_of_line_wait_on_bit+0x6f/0x78
Mar  9 11:21:20 st1 kernel: [1702551.406483]  [<ffffffff8111d315>] ?
wait_on_buffer+0x28/0x28
Mar  9 11:21:20 st1 kernel: [1702551.406489]  [<ffffffff8105fcad>] ?
autoremove_wake_function+0x2a/0x2a
Mar  9 11:21:20 st1 kernel: [1702551.406497]  [<ffffffffa05b156c>] ?
journal_commit_transaction+0x912/0xdce [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406502]  [<ffffffff81070fc1>] ?
arch_local_irq_save+0x11/0x17
Mar  9 11:21:20 st1 kernel: [1702551.406506]  [<ffffffff8134f209>] ?
_raw_spin_lock_irqsave+0x9/0x25
Mar  9 11:21:20 st1 kernel: [1702551.406510]  [<ffffffff8134f247>] ?
_raw_spin_unlock_irqrestore+0xe/0xf
Mar  9 11:21:20 st1 kernel: [1702551.406516]  [<ffffffffa05b4643>] ?
kjournald+0xe0/0x21e [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406521]  [<ffffffff8105fc83>] ?
add_wait_queue+0x3c/0x3c
Mar  9 11:21:20 st1 kernel: [1702551.406526]  [<ffffffffa05b4563>] ?
commit_timeout+0x5/0x5 [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406530]  [<ffffffff8105f631>] ?
kthread+0x76/0x7e
Mar  9 11:21:20 st1 kernel: [1702551.406535]  [<ffffffff81356374>] ?
kernel_thread_helper+0x4/0x10
Mar  9 11:21:20 st1 kernel: [1702551.406540]  [<ffffffff8105f5bb>] ?
kthread_worker_fn+0x139/0x139
Mar  9 11:21:20 st1 kernel: [1702551.406544]  [<ffffffff81356370>] ?
gs_change+0x13/0x13
Mar  9 11:21:20 st1 kernel: [1702551.406548] INFO: task rsync:4668 blocked
for more than 120 seconds.
Mar  9 11:21:20 st1 kernel: [1702551.406550] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  9 11:21:20 st1 kernel: [1702551.406552] rsync           D
ffff88041fb13780     0  4668   8750 0x00000004
Mar  9 11:21:20 st1 kernel: [1702551.406557]  ffff880404a9e340
0000000000000086 ffff880100000000 ffff88040f775610
Mar  9 11:21:20 st1 kernel: [1702551.406561]  0000000000013780
ffff880406293fd8 ffff880406293fd8 ffff880404a9e340
Mar  9 11:21:20 st1 kernel: [1702551.406566]  0000000000000246
000000018134f209 ffff88041fdd5c18 ffff8803c84a7978
Mar  9 11:21:20 st1 kernel: [1702551.406570] Call Trace:
Mar  9 11:21:20 st1 kernel: [1702551.406575]  [<ffffffffa05b0513>] ?
do_get_write_access+0x1ad/0x367 [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406580]  [<ffffffff8105fcad>] ?
autoremove_wake_function+0x2a/0x2a
Mar  9 11:21:20 st1 kernel: [1702551.406589]  [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406594]  [<ffffffffa05b07e5>] ?
journal_get_write_access+0x21/0x38 [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406600]  [<ffffffffa05cac8f>] ?
__ext3_get_inode_loc+0xf4/0x2cf [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406607]  [<ffffffffa05dd459>] ?
__ext3_journal_get_write_access+0x1c/0x49 [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406614]  [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406620]  [<ffffffffa05cc2ed>] ?
ext3_reserve_inode_write+0x39/0x7b [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406625]  [<ffffffffa05af372>] ?
start_this_handle+0x2b8/0x31c [jbd]
Mar  9 11:21:20 st1 kernel: [1702551.406631]  [<ffffffffa05cc389>] ?
ext3_mark_inode_dirty+0x5a/0x79 [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406638]  [<ffffffffa05df078>] ?
ext3_set_acl+0x87/0x1fc [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406645]  [<ffffffffa05df2c8>] ?
ext3_xattr_set_acl+0xdb/0x11d [ext3]
Mar  9 11:21:20 st1 kernel: [1702551.406650]  [<ffffffff811130ce>] ?
generic_setxattr+0x69/0x6f
Mar  9 11:21:20 st1 kernel: [1702551.406654]  [<ffffffff8111375d>] ?
__vfs_setxattr_noperm+0x66/0xcd
Mar  9 11:21:20 st1 kernel: [1702551.406657]  [<ffffffff811138e5>] ?
setxattr+0x95/0xf0
Mar  9 11:21:20 st1 kernel: [1702551.406662]  [<ffffffff81036628>] ?
should_resched+0x5/0x23
Mar  9 11:21:20 st1 kernel: [1702551.406666]  [<ffffffff81113831>] ?
vfs_setxattr+0x6d/0x8c
Mar  9 11:21:20 st1 kernel: [1702551.406669]  [<ffffffff81036628>] ?
should_resched+0x5/0x23
Mar  9 11:21:20 st1 kernel: [1702551.406672]  [<ffffffff8111390d>] ?
setxattr+0xbd/0xf0
Mar  9 11:21:20 st1 kernel: [1702551.406678]  [<ffffffff810eb124>] ?
kmem_cache_free+0x2d/0x69
Mar  9 11:21:20 st1 kernel: [1702551.406683]  [<ffffffff81105d33>] ?
user_path_at_empty+0x53/0x7b
Mar  9 11:21:20 st1 kernel: [1702551.406688]  [<ffffffff81113aed>] ?
sys_setxattr+0x6a/0x8f
Mar  9 11:21:20 st1 kernel: [1702551.406692]  [<ffffffff81354212>] ?
system_call_fastpath+0x16/0x1b

The process 'rsync' became unkillable. These kernel messages continued for
the next 10 minutes or so with some variations.
Even after the messages stopped, I/O was still extremely slow... freezing up
all the time. With slow/freezing I/O I don't mean just to this mountpoint
(which actually seemed totally frozen) but also to/from the other disks.
This box is acting as a NFS server. Remote systems saw a continuing 'nfs:
server .. not respondig'.. followed by 'nfs: .. OK' a while later, and then
not responding again.

The exact situation was:
(rsync on..) ext3 -> loopback/DM -> ext4 -> DRBD -> RAID -> disks

I used kpartx to create loopback/DM devices based on an .img file which was
mounted in /data (ext4 on DRBD), then mounted /dev/mapper/loop1p1 on /mnt
(ext3), and ran an rsync on this /mnt.
All data was previously synchronized without ACL/EA's and I re-ran rsync
with -AX added. This means the rsync job was mostly synchronizing ACL's and
Extended Attributes, maybe that is a significant clue.

The machine has 16GB of RAM, of which typically 15G is available for
caching/buffers. At the time of the hang this was also the case, I checked
with 'free'.

Some possibly relevant vm sysctl settings:
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 0
vm.dirty_bytes = 15000000
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
As you can see dirty bytes is very low (just 15M), so that rules out
kjournald blocking due to a very large dirty buffer.

Hopefully this report will help. Of course, if this is a known issue and
there's a solution available then I'd be eager to hear about it.

If you need any additional details or if I misfiled this bug, let me know.

ii  linux-image-3.2.0-4-amd64          3.2.51-1                      amd64
      Linux 3.2 for 64-bit PCs


Reply to: