[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#608118: linux-image-2.6-amd64: OCFS2 filesystem fails with Kernel Stacktraces



Package: linux-image-2.6-amd64
Version: 2.6.32+28
Severity: important
Tags: squeeze

I have built a Cluster using Corosync/OpenAIS using a dual-primary 
DRBD Device that is OCFS2 formatted. The Filesystem failed to mount 
throwing various stacktraces into the kernel log (included below).

The problem is difficult to reproduce on its own, however, it seems
to be releated to cases, where one of my cluster node needs to 
recover from a crash (simulated by hitting the reset key).

Case 1: At this point the system usually goes unresponsive and nees 
to be rebooted again (usually with a manual, gradual recovery of the
DRBD).

Dec 26 06:25:30 hohnode1 kernel: [241775.034274] dlm: closing connection to node 675260608
Dec 26 06:25:57 hohnode1 kernel: [241802.518115] VFS: Busy inodes after unmount of drbd1. Self-destruct in 5 seconds.  Have a nice day...
Dec 26 06:26:54 hohnode1 kernel: [241859.151069] VFS: Busy inodes after unmount of drbd0. Self-destruct in 5 seconds.  Have a nice day...
Dec 26 06:27:04 hohnode1 kernel: [241865.586618] dlm: closing connection to node 675260608
Dec 27 10:26:24 hohnode1 kernel: [  201.992153] block drbd0: conn( Unconnected -> WFConnection )
Dec 27 10:26:29 hohnode1 kernel: [  207.266543] block drbd0: role( Secondary -> Primary )
Dec 27 10:26:29 hohnode1 kernel: [  207.267110] block drbd0: Creating new current UUID
Dec 27 10:26:29 hohnode1 kernel: [  207.563813] ocfs2: Mounting device (147,0) on (node 3397162, slot 0) with ordered data mode.
Dec 27 10:27:04 hohnode1 kernel: [  234.401831] block drbd1: conn( StandAlone -> Unconnected )
Dec 27 10:27:04 hohnode1 kernel: [  234.401904] block drbd1: Starting receiver thread (from drbd1_worker [2580])
Dec 27 10:27:04 hohnode1 kernel: [  234.403452] block drbd1: receiver (re)started
Dec 27 10:27:04 hohnode1 kernel: [  234.403458] block drbd1: conn( Unconnected -> WFConnection )
Dec 27 10:27:10 hohnode1 kernel: [  240.687720] INFO: task crm_mon:3354 blocked for more than 120 seconds.
Dec 27 10:27:10 hohnode1 kernel: [  240.689707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 10:27:10 hohnode1 kernel: [  240.690395] crm_mon       D ffff880001815780     0  3354      1 0x00000004
Dec 27 10:27:10 hohnode1 kernel: [  240.690401]  ffff880017dde9f0 0000000000000082 0000000000000366 ffff88001d9a4730
Dec 27 10:27:10 hohnode1 kernel: [  240.690406]  00007fe0cdabb000 ffff880018439c28 000000000000f9e0 ffff880018439fd8
Dec 27 10:27:10 hohnode1 kernel: [  240.690409]  0000000000015780 0000000000015780 ffff88001f0a5bd0 ffff88001f0a5ec8
Dec 27 10:27:10 hohnode1 kernel: [  240.690413] Call Trace:
Dec 27 10:27:10 hohnode1 kernel: [  240.690440]  [<ffffffff810b826f>] ? zone_watermark_ok+0x20/0xb1
Dec 27 10:27:10 hohnode1 kernel: [  240.690448]  [<ffffffff810c73d0>] ? zone_statistics+0x3c/0x5d
Dec 27 10:27:10 hohnode1 kernel: [  240.690474]  [<ffffffffa04ab687>] ? ocfs2_wait_for_recovery+0x9d/0xb7 [ocfs2]
Dec 27 10:27:10 hohnode1 kernel: [  240.690484]  [<ffffffff81064cea>] ? autoremove_wake_function+0x0/0x2e
Dec 27 10:27:10 hohnode1 kernel: [  240.690498]  [<ffffffffa049b8fb>] ? ocfs2_inode_lock_full_nested+0x16b/0xb2c [ocfs2]
Dec 27 10:27:10 hohnode1 kernel: [  240.690509]  [<ffffffffa04a273c>] ? ocfs2_permission+0x6a/0x166 [ocfs2]
Dec 27 10:27:10 hohnode1 kernel: [  240.690518]  [<ffffffff810f6618>] ? __link_path_walk+0x75/0x6f5
Dec 27 10:27:10 hohnode1 kernel: [  240.690522]  [<ffffffff810f6ec6>] ? path_walk+0x66/0xc9
Dec 27 10:27:10 hohnode1 kernel: [  240.690526]  [<ffffffff810f8534>] ? do_filp_open+0x15d/0x94b
Dec 27 10:27:10 hohnode1 kernel: [  240.690530]  [<ffffffff811018ed>] ? alloc_fd+0x67/0x10c
Dec 27 10:27:10 hohnode1 kernel: [  240.690534]  [<ffffffff810ece53>] ? do_sys_open+0x55/0xfc
Dec 27 10:27:10 hohnode1 kernel: [  240.690542]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

Case 2:

This happend when I manually put one node into standby using
the CRM tools triggering a graceful recover. Again, the system 
triggered a kernel stacktrace and after that was unusable:

Dec 26 06:34:20 hohnode1 kernel: [242304.948271] dlm: closing connection to node 675260608
Dec 26 06:34:23 hohnode1 kernel: [242305.057213] ------------[ cut here ]------------
Dec 26 06:34:23 hohnode1 kernel: [242305.059443] kernel BUG at /build/buildd-linux-2.6_2.6.32-29-amd64-xcs37n/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/heartbe
at.c:68!
Dec 26 06:34:23 hohnode1 kernel: [242305.063307] invalid opcode: 0000 [#1] SMP
Dec 26 06:34:23 hohnode1 kernel: [242305.064175] last sysfs file: /sys/kernel/dlm/0C4565479CFF4E49A0B50D1EE16D5A39/control
Dec 26 06:34:23 hohnode1 kernel: [242305.064922] CPU 0
Dec 26 06:34:23 hohnode1 kernel: [242305.065771] Modules linked in: sha1_generic ocfs2 jbd2 ocfs2_nodemanager quota_tree ocfs2_stack_user ocfs2_stackglue dlm configfs vmblock
 vsock vmmemctl vmhgfs pvscsi acpiphp bonding drbd lru_cache cn loop snd_ens1371 gameport snd_rawmidi snd_seq_device snd_ac97_codec ac97_bus snd_pcm snd_timer snd parport_pc
shpchp parport i2c_piix4 soundcore psmouse snd_page_alloc evdev serio_raw pcspkr container vmci ac processor i2c_core pci_hotplug button ext3 jbd mbcache vmxnet sd_mod crc_t1
0dif sg sr_mod cdrom mptspi mptscsih uhci_hcd ata_generic mptbase scsi_transport_spi ata_piix ehci_hcd libata e1000 usbcore nls_base floppy thermal thermal_sys scsi_mod [last
 unloaded: scsi_wait_scan]
Dec 26 06:34:23 hohnode1 kernel: [242305.077615] Pid: 2424, comm: ocfs2_controld. Not tainted 2.6.32-5-amd64 #1 VMware Virtual Platform
Dec 26 06:34:23 hohnode1 kernel: [242305.078126] RIP: 0010:[<ffffffffa049a2bc>]  [<ffffffffa049a2bc>] ocfs2_do_node_down+0x13/0x7d [ocfs2]
Dec 26 06:34:23 hohnode1 kernel: [242305.078656] RSP: 0018:ffff880017adbe38  EFLAGS: 00010246
Dec 26 06:34:23 hohnode1 kernel: [242305.079168] RAX: ffff880017b32000 RBX: ffff88001e450000 RCX: 0000000000000046
Dec 26 06:34:23 hohnode1 kernel: [242305.079672] RDX: 0000000000000000 RSI: ffff88001e450000 RDI: 00000000143fa8c0
Dec 26 06:34:23 hohnode1 kernel: [242305.084363] RBP: 00000000143fa8c0 R08: 00000000143fa8c0 R09: 0000000000000010
Dec 26 06:34:23 hohnode1 kernel: [242305.085492] R10: ffffffff814a6e60 R11: ffffffff81152b79 R12: ffff88001f0fa8c0
Dec 26 06:34:23 hohnode1 kernel: [242305.085992] R13: 0000000000000020 R14: ffff880017adbe9d R15: 000000000000002f
Dec 26 06:34:23 hohnode1 kernel: [242305.086569] FS:  00007f101a03c760(0000) GS:ffff880001800000(0000) knlGS:0000000000000000
Dec 26 06:34:23 hohnode1 kernel: [242305.087083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 26 06:34:23 hohnode1 kernel: [242305.087585] CR2: 00007f1018c6bce7 CR3: 000000001dd48000 CR4: 00000000000006f0
Dec 26 06:34:23 hohnode1 kernel: [242305.088173] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 26 06:34:23 hohnode1 kernel: [242305.088705] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 26 06:34:23 hohnode1 kernel: [242305.089212] Process ocfs2_controld. (pid: 2424, threadinfo ffff880017ada000, task ffff880017ef8000)
Dec 26 06:34:23 hohnode1 kernel: [242305.089711] Stack:
Dec 26 06:34:23 hohnode1 kernel: [242305.095121]  0000000000000068 ffff88001f0fa900 00000000143fa8c0 ffffffffa04059fc
Dec 26 06:34:23 hohnode1 kernel: [242305.095142] <0> 0000000000000000 ffffffff00000000 ffff880017adbe98 0000000000000001
Dec 26 06:34:23 hohnode1 kernel: [242305.096214] <0> 0000000000000000 ffff880017adbec6 0000000000004000 ffff880017adbec8
Dec 26 06:34:23 hohnode1 kernel: [242305.097273] Call Trace:
Dec 26 06:34:23 hohnode1 kernel: [242305.097763]  [<ffffffffa04059fc>] ? ocfs2_control_write+0x4cc/0x525 [ocfs2_stack_user]
Dec 26 06:34:23 hohnode1 kernel: [242305.098285]  [<ffffffff810ef012>] ? vfs_write+0xa9/0x102
Dec 26 06:34:23 hohnode1 kernel: [242305.098776]  [<ffffffff810ef127>] ? sys_write+0x45/0x6e
Dec 26 06:34:23 hohnode1 kernel: [242305.099276]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Dec 26 06:34:23 hohnode1 kernel: [242305.099755] Code: 08 3e 41 0f ab 6c 24 08 66 41 ff 85 a8 00 00 00 5e 5b 5d 41 5c 41 5d c3 55 89 fd 53 48 89 f3 48 83 ec 08 39 be 30 01 00 00 75 04 <0f> 0b eb fe f6 05 2b 6a f8 ff 01 74 41 f6 05 2a 6a f8 ff 01 75
Dec 26 06:34:23 hohnode1 kernel: [242305.105912] RIP  [<ffffffffa049a2bc>] ocfs2_do_node_down+0x13/0x7d [ocfs2]
Dec 26 06:34:23 hohnode1 kernel: [242305.106420]  RSP <ffff880017adbe38>
Dec 26 06:34:23 hohnode1 kernel: [242307.965106] ---[ end trace ddf49d8bad9f95b8 ]---
Dec 26 06:34:35 hohnode1 kernel: [242319.467021] dlm: closing connection to node 675260608
Dec 27 09:57:30 hohnode1 kernel: [340807.948567] block drbd0: Handshake successful: Agreed network protocol version 91
Dec 27 09:57:30 hohnode1 kernel: [340807.949751] block drbd0: conn( WFConnection -> WFReportParams )
Dec 27 09:57:30 hohnode1 kernel: [340807.950258] block drbd0: Starting asender thread (from drbd0_receiver [2856])
Dec 27 09:57:30 hohnode1 kernel: [340807.961386] block drbd0: data-integrity-alg: sha1
Dec 27 09:57:30 hohnode1 kernel: [340807.962901] block drbd0: drbd_sync_handshake:
Dec 27 09:57:30 hohnode1 kernel: [340807.963327] block drbd0: self F70AE1510F667B27:EC93CE03D20D7C23:2AC9FADDCB76125A:94AAAFD676CD3A1C bits:5 flags:0
Dec 27 09:57:30 hohnode1 kernel: [340807.963829] block drbd0: peer B85D09159BAD14C8:EC93CE03D20D7C23:2AC9FADDCB76125A:94AAAFD676CD3A1C bits:3072 flags:2
Dec 27 09:57:30 hohnode1 kernel: [340807.964301] block drbd0: uuid_compare()=100 by rule 90
Dec 27 09:57:30 hohnode1 kernel: [340807.964782] block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node
Dec 27 09:57:30 hohnode1 kernel: [340807.965247] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Dec 27 09:57:30 hohnode1 kernel: [340808.014734] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Dec 27 09:57:30 hohnode1 kernel: [340808.015873] block drbd0: Began resync as SyncSource (will sync 12288 KB [3072 bits set]).
Dec 27 09:57:30 hohnode1 kernel: [340808.631482] block drbd0: Resync done (total 1 sec; paused 0 sec; 12288 K/sec)
Dec 27 09:57:30 hohnode1 kernel: [340808.632698] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Dec 27 09:57:38 hohnode1 kernel: [340816.025882] block drbd0: peer( Secondary -> Primary )
Dec 27 10:03:25 hohnode1 kernel: [341163.100806] BUG: unable to handle kernel paging request at ffff8800283fa8c0
Dec 27 10:03:25 hohnode1 kernel: [341163.102657] IP: [<ffffffff810e6dee>] kmem_cache_alloc+0x61/0xf0
Dec 27 10:03:25 hohnode1 kernel: [341163.114378] PGD 1002063 PUD 1006063 PMD 0
Dec 27 10:03:25 hohnode1 kernel: [341163.114803] Oops: 0000 [#2] SMP

I will attach the full kernel logs for further reference.


-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/1 CPU core)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages linux-image-2.6-amd64 depends on:
ii  linux-image-2.6.32-5-amd64    2.6.32-29  Linux 2.6.32 for 64-bit PCs

linux-image-2.6-amd64 recommends no packages.

linux-image-2.6-amd64 suggests no packages.

-- no debconf information

Attachment: kern.log.bz2
Description: BZip2 compressed data


Reply to: