Hang on NFS storage machines

To: debian-kernel@lists.debian.org
Subject: Hang on NFS storage machines
From: George Barnett <gbarnett@atlassian.com>
Date: Sat, 16 Mar 2013 16:12:43 +1100
Message-id: <[🔎] B2EC601CDDA242189A46599B31EA6AD3@atlassian.com>

Hi, 

We use debian for a number of machines in our storage infrastructure and we have recently been seeing a number of "hangs". We primary notice this by seeing nfsd processes locking up and then a hung task killer going wild. We finally managed to get a trace last night - its pasted below: 

We did not see this crash under 2.6.39 back port however this kernel spontaneously rebooted at ~200 days uptime (we had about 3/4 of our infra reboot in a few weeks. It was not a good time for our ops teams). 

I would be grateful if anybody who could help me narrow this down would jump in and help with requests for further info, or provide further advice. 


[11309697.466397] ------------[ cut here ]------------ 
[11309697.466556] WARNING: at /build/buildd-linux_3.2.23-1~bpo60+2-amd64-oLufer/linux-3.2.23/fs/jbd2/journal.c:507 __jbd2_log_start_commit+0x7e/0x8c [jbd2]()
[11309697.466660] Hardware name: X8DT6
[11309697.466728] JBD2: bad log_start_commit: 2205591757 2205591757 14613566 0
[11309697.466798] Modules linked in: netconsole autofs4 8021q garp bridge stp nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding tcp_htcp ext4 jbd2 crc16 configfs loop ohci_hcd tpm_tis tpm i7core_edac i2c_i801 snd_pcm snd_timer snd soundcore edac_core i2c_core ioatdma tpm_bios snd_page_alloc coretemp crc32c_intel psmouse pcspkr joydev evdev acpi_cpufreq mperf processor serio_raw button thermal_sys ext3 jbd mbcache usbhid hid sd_mod ses enclosure crc_t10dif uhci_hcd ahci libahci libata igb ehci_hcd e1000e usbcore dca megaraid_sas usb_common scsi_mod [last unloaded: netconsole]
[11309697.470190] Pid: 62, comm: kswapd0 Not tainted 3.2.0-0.bpo.3-amd64 #1
[11309697.470261] Call Trace:
[11309697.470329] [<ffffffff810498a8>] ? warn_slowpath_common+0x78/0x8c
[11309697.470399] [<ffffffff8104995a>] ? warn_slowpath_fmt+0x45/0x4a
[11309697.470471] [<ffffffffa01cabad>] ? __jbd2_log_start_commit+0x7e/0x8c [jbd2]
[11309697.470558] [<ffffffffa01cac83>] ? jbd2_log_start_commit+0x21/0x2f [jbd2]
[11309697.470634] [<ffffffffa02dee7a>] ? ext4_evict_inode+0x86/0x2d1 [ext4]
[11309697.470707] [<ffffffff81119626>] ? evict+0x9a/0x14e
[11309697.470775] [<ffffffff811198b4>] ? dispose_list+0x35/0x3f
[11309697.470844] [<ffffffff81119b87>] ? prune_icache_sb+0x2c9/0x2d8
[11309697.470915] [<ffffffff811081b0>] ? prune_super+0xd6/0x147
[11309697.470987] [<ffffffff810cb9e2>] ? shrink_slab+0x1a3/0x266
[11309697.471056] [<ffffffff810cd937>] ? balance_pgdat+0x335/0x625
[11309697.471126] [<ffffffff810cdf31>] ? kswapd+0x30a/0x325
[11309697.471196] [<ffffffff81063815>] ? wake_up_bit+0x20/0x20
[11309697.471265] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625
[11309697.471334] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625
[11309697.471403] [<ffffffff810633d9>] ? kthread+0x7a/0x82
[11309697.471472] [<ffffffff8136d3f4>] ? kernel_thread_helper+0x4/0x10
[11309697.471543] [<ffffffff8106335f>] ? kthread_worker_fn+0x147/0x147
[11309697.471613] [<ffffffff8136d3f0>] ? gs_change+0x13/0x13
[11309697.471680] ---[ end trace 56d2be5ea52d0917 ]---





-- 
George Barnett

gbarnett@atlassian.com

Reply to:

Follow-Ups:
- jbd2 tid wrap seen on NFS server
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Bug#703142: marked as done (Kernel image 3.2.39-2: i915 module will not load.)
Next by Date: jbd2 tid wrap seen on NFS server
Previous by thread: Hang on NFS storage machines
Next by thread: jbd2 tid wrap seen on NFS server
Index(es):
- Date
- Thread