--- Begin Message ---
Package: ocfs2-tools
Version: 1.4.1-1
Severity: critical
Justification: causes high load, reboot on both nodes (2-node cluster)
Log details on node1:
--
Jan 19 09:49:33 kernel: [5665461.514795] (6894,14):dlm_drop_lockres_ref:2224 ERROR: while dropping ref on A35DE40B6A044A4A873B96E2F2DE42B2:M000000000000000112401200000000 (maste
r=0) got -22.
Jan 19 09:49:33 kernel: [5665461.805602] lockres: M00000000000000011240120000000, owner=0, state=64
Jan 19 09:49:33 kernel: [5665461.932077] last used: 5332038594, refcnt: 3, on purge list: yes
Jan 19 09:49:33 kernel: [5665462.148475] on dirty list: no, on reco list: no, migrating pending: no
Jan 19 09:49:33 kernel: [5665462.274649] inflight locks: 0, asts reserved: 0
Jan 19 09:49:33 kernel: [5665462.274649] refmap nodes: [ ], inflight=0
Jan 19 09:49:33 kernel: [5665462.274649] granted queue:
Jan 19 09:49:33 kernel: [5665462.274649] converting queue:
Jan 19 09:49:33 kernel: [5665462.274649] blocked queue:
Jan 19 09:49:33 kernel: [5665462.274649] ------------[ cut here ]------------
Jan 19 09:49:33 kernel: [5665462.274649] kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2226!
Jan 19 09:49:33 kernel: [5665462.274649] invalid opcode: 0000 [1] SMP
Jan 19 09:49:33 kernel: [5665462.274649] CPU 14
Jan 19 09:49:33 kernel: [5665462.274649] Modules linked in: nls_utf8 cifs nls_base ip_vs_rr xt_connlimit nfs ocfs2 ip_vs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipt_LOG xt_limit nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables dm_rdac
qla2xxx bnx2 firmware_class usbhid uhci_hcd thermal sr_mod snd_pcm snd_timer snd_page_alloc snd soundcore shpchp sg sd_mod scsi_transport_fc scsi_tgt processor pcspkr pci_hotplug meg
araid_sas loop ipv6 ide_pci_generic ide_core i2c_i801 i2c_core hid ff_memless fan thermal_sys ext3 jbd mbcache evdev ehci_hcd dm_round_robin dm_multipath dm_mod cdrom cdc_ether usbne
t mii button ata_piix ata_generic libata scsi_mod dock
Jan 19 09:49:33 kernel: [5665462.274649] Pid: 6894, comm: dlm_thread Not tainted 2.6.26-2-amd64 #1
Jan 19 09:49:33 kernel: [5665462.274649] RIP: 0010:[<ffffffffa038c381>] [<ffffffffa038c381>] :ocfs2_dlm:dlm_drop_lockres_ref+0x1dd/0x1f0
Jan 19 09:49:33 kernel: [5665462.274649] RSP: 0018:ffff810875ceddd0 EFLAGS: 00010202
Jan 19 09:49:33 kernel: [5665462.274649] RAX: ffff8105364e8888 RBX: 0000000000000000 RCX: 00000000031a9f89
Jan 19 09:49:33 kernel: [5665462.274649] RDX: 0000000000000000 RSI: 0000000000000034 RDI: 0000000000000282
Jan 19 09:49:33 kernel: [5665462.274649] RBP: 000000000000001f R08: 0000000000000000 R09: ffff810875ced900
Jan 19 09:49:33 kernel: [5665462.274649] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8105364e8840
Jan 19 09:49:33 kernel: [5665462.274649] R13: ffff81086ddd7800 R14: ffff81070616bb80 R15: 00000000000000b5
Jan 19 09:49:33 kernel: [5665462.274649] FS: 0000000000000000(0000) GS:ffff81107cf981c0(0000) knlGS:0000000000000000
Jan 19 09:49:33 kernel: [5665462.274649] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 19 09:49:33 kernel: [5665462.274649] CR2: 0000000002694000 CR3: 0000000000201000 CR4: 00000000000006e0
Jan 19 09:49:33 kernel: [5665462.274649] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 19 09:49:33 kernel: [5665462.274649] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 19 09:49:33 kernel: [5665462.274649] Process dlm_thread (pid: 6894, threadinfo ffff810875cec000, task ffff81086e521770)
Jan 19 09:49:33 kernel: [5665462.274649] Stack: 000000000000001f ffff81070616bb80 ffff810500000000 00000000ffffffea
Jan 19 09:49:33 kernel: [5665462.274649] 1f01000000000000 303030303030304d 3030303030303030 3032313034323131
Jan 19 09:49:33 kernel: [5665462.274649] 0030303030303030 0000000000000000 0000000000000000 0000000000000000
Jan 19 09:49:33 kernel: [5665462.274649] Call Trace:
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffffa0381860>] ? :ocfs2_dlm:dlm_thread+0x237/0x1107
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff802461a5>] ? autoremove_wake_function+0x0/0x2e
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffffa0381629>] ? :ocfs2_dlm:dlm_thread+0x0/0x1107
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff8024607f>] ? kthread+0x47/0x74
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff802300ed>] ? schedule_tail+0x27/0x5c
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff8020cf38>] ? child_rip+0xa/0x12
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff8021a866>] ? lapic_next_event+0xf/0x13
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff80246038>] ? kthread+0x0/0x74
Jan 19 09:49:33 kernel: [5665462.274649] [<ffffffff8020cf2e>] ? child_rip+0x0/0x12
Jan 19 09:49:33 kernel: [5665462.274649]
Jan 19 09:49:33 kernel: [5665462.274649]
Jan 19 09:49:33 kernel: [5665462.274649] Code: 8b 14 25 24 00 00 00 48 c7 c1 e0 89 39 a0 89 d2 4c 89 74 24 08 89 44 24 10 31 c0 89 2c 24 e8 2c 90 ea df 4c 89 e7 e8 32 43 ff ff <0f> 0b eb fe 48 83 c4 70 89 d8 5b 5d 41 5c 41 5d 41 5e c3 41 54
Jan 19 09:49:33 kernel: [5665462.274649] RIP [<ffffffffa038c381>] :ocfs2_dlm:dlm_drop_lockres_ref+0x1dd/0x1f0
Jan 19 09:49:33 kernel: [5665462.274649] RSP <ffff810875ceddd0>
Jan 19 09:49:33 kernel: [5665462.422453] ---[ end trace ee1657d875d4e1f1 ]---
--
--
Jan 19 09:54:05 kernel: [5665830.740248] o2net: connection to node XXX (num 0) at x.x.x.x:xxxx has been idle for 30.0 seconds, shutting it down.
Jan 19 09:54:05 kernel: [5665831.043225] (0,12):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1295427215.500604 now 1295427245.497577 dr 12
95427215.497446 adv 1295427215.500636:1295427215.500637 func (8737b25e:500) 1295427215.500605:1295427215.500635)
Jan 19 09:54:05 kernel: [5665831.247064] o2net: no longer connected to node XXX (num 0) at x.x.x.x:xxxx
Jan 19 09:54:25 kernel: [5665831.427022] (22635,0):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (6482,12):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.423470] (4102,7):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (21432,1):dlm_send_remote_unlock_request:359 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.423471] (5686,4):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.423470] (7690,8):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (21552,15):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (6810,14):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (6910,9):dlm_drop_lockres_ref:2219 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427378] (7049,11):dlm_drop_lockres_ref:2219 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (8202,3):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427022] (7005,10):dlm_drop_lockres_ref:2219 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (7932,2):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.651713] (7770,13):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.423470] (6159,5):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (6482,12):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.423470] (4102,7):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.423471] (5686,4):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.423470] (7690,8):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427378] (21552,15):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427378] (7295,1):dlm_drop_lockres_ref:2219 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (24522,6):dlm_do_master_request:1342 ERROR: link to 0 went down!
Jan 19 09:54:25 kernel: [5665831.427378] (6810,14):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427432] (6910,9):dlm_purge_lockres:190 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427378] (7049,11):dlm_purge_lockres:190 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (8202,3):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427022] (7005,10):dlm_purge_lockres:190 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (7932,2):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.651713] (7770,13):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.423470] (6159,5):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:25 kernel: [5665831.427378] (4172,12):dlm_do_master_request:1342 ERROR: link to 0 went down!
--
Problem reflections on the node2:
--
Jan 19 09:54:36 kernel: [5414618.046127] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046133] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046138] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046165] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046170] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046180] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046184] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046208] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046213] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046255] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046259] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046277] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046281] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046317] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046322] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046363] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046367] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046374] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046379] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046394] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046399] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046405] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.046410] (20385,2):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jan 19 09:54:36 kernel: [5414618.147644] o2net: accepted connection from node XXX (num 1) at x.x.x.x:xxxx
Jan 19 09:54:36 kernel: [5414620.867740] (31712,1):dlm_do_master_request:1342 ERROR: link to 1 went down!
Jan 19 09:54:36 kernel: [5414620.867740] (31712,1):dlm_get_lock_resource:919 ERROR: status = -112
Jan 19 09:54:36 kernel: [5414620.871982] (31124,9):dlm_do_master_request:1342 ERROR: link to 1 went down!
Jan 19 09:54:36 kernel: [5414620.871982] (31124,9):dlm_get_lock_resource:919 ERROR: status = -112
--
Technical investigations resulted that it was not caused by network problem.
-- System Information:
Debian Release: 5.0.7
APT prefers stable
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.26-26lenny1
CPU: Intel(R) Xeon(R) CPU E5620
Versions of packages ocfs2-tools depends on:
ii libc6 2.7-18lenny6
ii libcomerr2 1.41.3-1
ii libglib2.0-0 2.16.6-3
ii libncurses5 5.7+20081213-1
ii libreadline5 5.2-3.1
ii libuuid1 1.41.3-1
Versions of packages ocfs2-tools suggests:
ii ocfs2console 1.4.1-1
/etc/default/o2cb values:
O2CB_HEARTBEAT_THRESHOLD=31
O2CB_IDLE_TIMEOUT_MS=30000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000
Regards,
Szabolcs JANOSI
--- End Message ---
--- Begin Message ---
- To: 610530-done@bugs.debian.org
- Cc: Szabolcs JANOSI <janosi.szabolcs@allegroup.hu>
- Subject: Re: ocfs2-tools: BUG at fs/ocfs2/dlm/dlmmaster.c:2226! invalid opcode
- From: Moritz Muehlenhoff <jmm@inutil.org>
- Date: Mon, 12 Aug 2013 17:06:52 +0200
- Message-id: <20130812150652.GA10215@inutil.org>
- In-reply-to: <20120209225234.GA3853@burratino>
- References: <4D36EE52.5060700@allegroup.hu> <20120209225234.GA3853@burratino>
On Thu, Feb 09, 2012 at 04:52:34PM -0600, Jonathan Nieder wrote:
> reassign 610530 linux-2.6 linux-2.6/2.6.26-26lenny1
> quit
>
> Hi Szabolcs,
>
> Szabolcs JANOSI wrote:
>
> > Justification: causes high load, reboot on both nodes (2-node cluster)
> >
> > Log details on node1:
> >
> > (6894,14):dlm_drop_lockres_ref:2224 ERROR: while dropping ref on A35DE40B6A044A4A873B96E2F2DE42B2:M000000000000000112401200000000 (master=0) got -22.
> > lockres: M00000000000000011240120000000, owner=0, state=64
> > last used: 5332038594, refcnt: 3, on purge list: yes
> > on dirty list: no, on reco list: no, migrating pending: no
> > inflight locks: 0, asts reserved: 0
> > refmap nodes: [ ], inflight=0
> > granted queue:
> > converting queue:
> > blocked queue:
> > ------------[ cut here ]------------
> > kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2226!
> [...]
> > Code: 8b 14 25 24 00 00 00 48 c7 c1 e0 89 39 a0 89 d2 4c 89 74 24 08 89 44 24 10 31 c0 89 2c 24 e8 2c 90 ea df 4c 89 e7 e8 32 43 ff ff <0f> 0b eb fe 48 83 c4 70 89 d8 5b 5d 41 5c 41 5d 41 5e c3 41 54
> > RIP [<ffffffffa038c381>] :ocfs2_dlm:dlm_drop_lockres_ref+0x1dd/0x1f0
> [...]
> > Technical investigations resulted that it was not caused by network problem.
>
> I guess this was reproducible. Was it a regression? (I.e., do you
> know of any previous kernel that worked ok?)
>
> | $ git show debian/lenny:fs/ocfs2/dlm/dlmmaster.c | sed -n 2220,2226' 'p
> | else if (r < 0) {
> | /* BAD. other node says I did not have a ref. */
> | mlog(ML_ERROR,"while dropping ref on %s:%.*s "
> | "(master=%u) got %d.\n", dlm->name, namelen,
> | lockname, res->owner, r);
> | dlm_print_one_lock_resource(res);
> | BUG();
>
> What kernel do you use these days? Can you still reproduce this?
>
> If you can reproduce this with a current squeeze or sid kernel, the next
> step will be to get in touch from upstream. Sorry we missed this before.
No further feedback, closing the bug.
If the bug can be reproduced with a current kernel (e.g. Wheezy), please
reopen.
Cheers,
Moritz
--- End Message ---