xen qemu on nfs livelock

To: debian-user@lists.debian.org
Subject: xen qemu on nfs livelock
From: krichy@tvnetwork.hu
Date: Tue, 29 Sep 2015 09:33:32 +0200 (CEST)
Message-id: <[🔎] alpine.DEB.2.20.1509290835400.4056@krichy.tvnetwork.hu>

Dear users,

We have a ganeti/xen cluster set up with shared file storage, the nfsserver is a freenas server.

It has happend the second time that vm's hosted on the shared file storagegot blocked:

Sep 29 05:50:49 node-5 kernel: [564977.898152] INFO: taskqemu-system-i38:8138 blocked for more than 120 seconds.Sep 29 05:50:49 node-5 kernel: [564977.898178] Not tainted3.16.0-4-amd64 #1Sep 29 05:50:49 node-5 kernel: [564977.898192] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.Sep 29 05:50:49 node-5 kernel: [564977.898212] qemu-system-i38 Dffff88002339cfb8 0 8138 1 0x00000000Sep 29 05:50:49 node-5 kernel: [564977.898216] ffff88002339cb600000000000000286 0000000000012f00 ffff88001adabfd8Sep 29 05:50:49 node-5 kernel: [564977.898220] 0000000000012f00ffff88002339cb60 ffff88002ae61c08 ffff88001adabdf0Sep 29 05:50:49 node-5 kernel: [564977.898223] ffff88002ae61c0cffff88002339cb60 00000000ffffffff ffff88002ae61c10

Sep 29 05:50:49 node-5 kernel: [564977.898226] Call Trace:

Sep 29 05:50:49 node-5 kernel: [564977.898238] [<ffffffff8150e225>] ?schedule_preempt_disabled+0x25/0x70Sep 29 05:50:49 node-5 kernel: [564977.898241] [<ffffffff8150fcd3>] ?__mutex_lock_slowpath+0xd3/0x1c0Sep 29 05:50:49 node-5 kernel: [564977.898243] [<ffffffff8150fddb>] ?mutex_lock+0x1b/0x2aSep 29 05:50:49 node-5 kernel: [564977.898247] [<ffffffff8113ef0e>] ?generic_file_write_iter+0x2e/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898256] [<ffffffffa054e9d3>] ?nfs_file_write+0x83/0x1a0 [nfs]Sep 29 05:50:49 node-5 kernel: [564977.898260] [<ffffffff811a7ad4>] ?new_sync_write+0x74/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898263] [<ffffffff811a8212>] ?vfs_write+0xb2/0x1f0Sep 29 05:50:49 node-5 kernel: [564977.898267] [<ffffffff810d469e>] ?SyS_futex+0x6e/0x150Sep 29 05:50:49 node-5 kernel: [564977.898269] [<ffffffff811a8eb3>] ?SyS_pwrite64+0x63/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898272] [<ffffffff8151164d>] ?system_call_fast_compare_end+0x10/0x15Sep 29 05:50:49 node-5 kernel: [564977.898274] INFO: taskqemu-system-i38:8142 blocked for more than 120 seconds.Sep 29 05:50:49 node-5 kernel: [564977.898290] Not tainted3.16.0-4-amd64 #1Sep 29 05:50:49 node-5 kernel: [564977.898302] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.Sep 29 05:50:49 node-5 kernel: [564977.898319] qemu-system-i38 Dffff880018f69808 0 8142 1 0x00000000Sep 29 05:50:49 node-5 kernel: [564977.898321] ffff880018f693b00000000000000286 0000000000012f00 ffff880006ce7fd8Sep 29 05:50:49 node-5 kernel: [564977.898323] 0000000000012f00ffff880018f693b0 ffff88002ae61c08 ffff880006ce7df0Sep 29 05:50:49 node-5 kernel: [564977.898325] ffff88002ae61c0cffff880018f693b0 00000000ffffffff ffff88002ae61c10

Sep 29 05:50:49 node-5 kernel: [564977.898327] Call Trace:

Sep 29 05:50:49 node-5 kernel: [564977.898329] [<ffffffff8150e225>] ?schedule_preempt_disabled+0x25/0x70Sep 29 05:50:49 node-5 kernel: [564977.898331] [<ffffffff8150fcd3>] ?__mutex_lock_slowpath+0xd3/0x1c0Sep 29 05:50:49 node-5 kernel: [564977.898333] [<ffffffff8150fddb>] ?mutex_lock+0x1b/0x2aSep 29 05:50:49 node-5 kernel: [564977.898335] [<ffffffff8113ef0e>] ?generic_file_write_iter+0x2e/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898340] [<ffffffffa054e9d3>] ?nfs_file_write+0x83/0x1a0 [nfs]Sep 29 05:50:49 node-5 kernel: [564977.898342] [<ffffffff811a7ad4>] ?new_sync_write+0x74/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898345] [<ffffffff811a8212>] ?vfs_write+0xb2/0x1f0Sep 29 05:50:49 node-5 kernel: [564977.898346] [<ffffffff811a8eb3>] ?SyS_pwrite64+0x63/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898349] [<ffffffff8151164d>] ?system_call_fast_compare_end+0x10/0x15Sep 29 05:50:49 node-5 kernel: [564977.898350] INFO: taskqemu-system-i38:8143 blocked for more than 120 seconds.Sep 29 05:50:49 node-5 kernel: [564977.898367] Not tainted3.16.0-4-amd64 #1Sep 29 05:50:49 node-5 kernel: [564977.898378] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.Sep 29 05:50:49 node-5 kernel: [564977.898395] qemu-system-i38 Dffff880024e31908 0 8143 1 0x00000000Sep 29 05:50:49 node-5 kernel: [564977.898397] ffff880024e314b00000000000000286 0000000000012f00 ffff880006c47fd8Sep 29 05:50:49 node-5 kernel: [564977.898399] 0000000000012f00ffff880024e314b0 ffff88002ae61c08 ffff880006c47d40Sep 29 05:50:49 node-5 kernel: [564977.898401] ffff88002ae61c0cffff880024e314b0 00000000ffffffff ffff88002ae61c10

Sep 29 05:50:49 node-5 kernel: [564977.898403] Call Trace:

Sep 29 05:50:49 node-5 kernel: [564977.898405] [<ffffffff8150e225>] ?schedule_preempt_disabled+0x25/0x70Sep 29 05:50:49 node-5 kernel: [564977.898407] [<ffffffff8150fcd3>] ?__mutex_lock_slowpath+0xd3/0x1c0Sep 29 05:50:49 node-5 kernel: [564977.898409] [<ffffffff8150fddb>] ?mutex_lock+0x1b/0x2aSep 29 05:50:49 node-5 kernel: [564977.898411] [<ffffffff8113ef0e>] ?generic_file_write_iter+0x2e/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898415] [<ffffffffa054e9d3>] ?nfs_file_write+0x83/0x1a0 [nfs]Sep 29 05:50:49 node-5 kernel: [564977.898418] [<ffffffff811a7a60>] ?new_sync_read+0xa0/0xa0Sep 29 05:50:49 node-5 kernel: [564977.898420] [<ffffffff811a7b5f>] ?do_iter_readv_writev+0x5f/0x90

...

The nfs server was alive, as 6 other nodes in the cluster was workingwell. Also I could access the mounted share on the local machine, I couldlist directories, and also was able to copy files from the share. Thestrange was that copying a file from the share to /tmp was really _slow_,at around 128k/sec. But neither the local disks neither the nfs server wasunder load. Packet captures on the ethernet interface showed no nfstraffic, while the destination file in /tmp/ was just growing very slowly.This means to me that the nfs read command was fast, was finished in time,but somewhere in the kernel processing that reply was blocked or slowedsomehow.


Actually I did not test whether I could write to the nfs share.

As the vm's were blocked, I decided to move (failover) the runninginstances to another node, and keep this node running for later debugging.

I killed one instance with 'xl destroy' and then did a 'ganeti-instancefailover' on it, and after all I've noticed that all vm's have recovered,the block on io has disappeared.

Now I dont know if I chose another vm to kill/move, would I get thesame results, would the deadlock/livelock disappear also.

Actually I cannot reproduce the problem, I am just waiting to happen itagain and collect more information.


The node is a very minimal debian/jessie, with ganeti and xen installed.

Has someone else experienced such issues?

Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

Reply to:

Prev by Date: Re: Terminal Madness?
Next by Date: Re: Terminal Madness?
Previous by thread: Re: Terminal Madness?
Next by thread: Connection failure: Amavis-> ClamAV
Index(es):
- Date
- Thread