[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#598323: marked as done (DRBD+OCFS2: reproducible BUGs and GPFs on heavy load)



Your message dated Tue, 21 Feb 2012 04:24:43 -0600
with message-id <20120221102443.GA28089@burratino>
and subject line Re: [squeeze] Kernel bug seems to occur on ocfs2+drbd in pri-pri
has caused the Debian Bug report #616726,
regarding DRBD+OCFS2: reproducible BUGs and GPFs on heavy load
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
616726: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616726
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important


Hello.

First of all - this it my first bugreport to debian and I sorry if I do something wrong - just tell me what need to fix in it.

I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.

I report bug for a package linux-image-2.6.35.6 but it is not true - I have this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what couse this reboots.

What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0

2) Sync it
drbdadm -- --overwrite-data-of-peer primary drbd0
drbdsetup /dev/drbd0 syncer -r 110M

3) Make both primary 
drbdadm primary drbd0

4) Make FS
mkfs.ocfs2 -L ocfs2_drbd -N 2 -T mail --fs-feature-level=max-features /dev/drbd0

5) Mount it on both nodes
mount /var/spool/dovecot
(fstab options -  nodev,noauto,noatime,data=writeback)

6) Make folders for test
mkdir /var/spool/dovecot/iozone1
mkdir /var/spool/dovecot/iozone2

7) Start IO test on both nodes in different folders
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls

8) Allways got reboot after 30-180 min. Sometimes with stack trace and halt but not everytime.

OCFS2 partition seems to work ok at normal work.

P.S. If i was wrong to write this in sid like system - just tell me. This bug easly repeatable on stable or testing.

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages linux-image-2.6.35.6 depends on:
ii  coreutils                     8.5-1      GNU core utilities
ii  debconf [debconf-2.0]         1.5.35     Debian configuration management sy

linux-image-2.6.35.6 recommends no packages.

Versions of packages linux-image-2.6.35.6 suggests:
pn  fdutils                       <none>     (no description available)
pn  ksymoops                      <none>     (no description available)
pn  linux-doc-2.6.35.6 | linux-so <none>     (no description available)
pn  linux-image-2.6.35.6-dbg      <none>     (no description available)

-- debconf information:
  linux-image-2.6.35.6/postinst/old-dir-initrd-link-2.6.35.6: true
  linux-image-2.6.35.6/prerm/removing-running-kernel-2.6.35.6: true
  linux-image-2.6.35.6/preinst/abort-overwrite-2.6.35.6:
  linux-image-2.6.35.6/postinst/old-system-map-link-2.6.35.6: true
  linux-image-2.6.35.6/preinst/already-running-this-2.6.35.6:
  linux-image-2.6.35.6/preinst/overwriting-modules-2.6.35.6: true
  linux-image-2.6.35.6/postinst/depmod-error-initrd-2.6.35.6: false
  linux-image-2.6.35.6/postinst/kimage-is-a-directory:
  linux-image-2.6.35.6/preinst/failed-to-move-modules-2.6.35.6:
  linux-image-2.6.35.6/postinst/depmod-error-2.6.35.6: false
node:
        ip_port = 7777
        ip_address = 192.168.1.1
        number = 0
        name = mail01.fxclub.org
        cluster = ocfs2
 
node:
        ip_port = 7777
        ip_address = 192.168.1.2
        number = 1
        name = mail02.fxclub.org
        cluster = ocfs2
 
cluster:
        node_count = 2
        name = ocfs2
resource drbd0 {
 
on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}
 
on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}
 
}
global {
	usage-count yes;
	# minor-count dialog-refresh disable-ip-verification
}
 
common {
	protocol C;
 
	handlers {
                # What should be done in case the node is primary, degraded (=no connection) and has inconsistent data.
                #pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                #pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /sbin/ifconfig eth1 down";
                # The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
                #pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                #pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /sbin/ifconfig eth1 down";
		#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
		#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
		# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
	}
 
	startup {
		wfc-timeout 60;
		degr-wfc-timeout 30;
		outdated-wfc-timeout 15;
		become-primary-on both;
		# wait-after-sb;
	}
 
	disk {
		fencing resource-and-stonith;
		# RAID WITH BBU ONLY!!!
		no-disk-flushes;
		no-md-flushes;
		no-disk-barrier;
		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
		# no-disk-drain no-md-flushes max-bio-bvecs   
	}
 
	net {
		cram-hmac-alg sha1;
		shared-secret "password";
		allow-two-primaries;
		ping-timeout 20;
		#after-sb-0pri discard-zero-changes;
		#after-sb-1pri discard-secondary;
		#after-sb-2pri disconnect;	
		data-integrity-alg sha1;
                # Tuning
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 0;
		# snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
	}
 
	syncer {
		# MagaBYTE! Not Bit.
		rate 40M;
		al-extents 3389;
		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
	}
}
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
  Network idle timeout: 15000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Stable:
Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173794] ------------[ cut here ]------------

Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173872] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt


Testing:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Stack:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Call Trace:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c 8b 4f

Testing: 2.6.35 + DRBD 8.3.8
mail01:/usr/local/sbin# mount /var/spool/dovecot

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451530] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452451] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452623] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4c 24 18 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.461099] general protection fault: 0000 [#2] SMP 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin# 
Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1


55921.451479] ------------[ cut here ]------------
[55921.451506] kernel BUG at mm/slub.c:2834!
[55921.451530] invalid opcode: 0000 [#1] SMP 
[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.451584] CPU 1 
[55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm
_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.451964] 
[55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.452027] RIP: 0010:[<ffffffff810df05d>]  [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.452076] RSP: 0018:ffff88012aa61d58  EFLAGS: 00010246
[55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX: 0000000000000002
[55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000001
[55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09: 0000000000000000
[55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12: ffffffff810f061e
[55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15: ffff88012bce1070
[55921.452248] FS:  00007f7646a967a0(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.452293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4: 00000000000006e0
[55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task ffff880121f4d890)
[55921.452451] Stack:
[55921.452471]  0000000000000000 ffff8800375d8f00 ffff88012bce1070 ffffffff810f061e
[55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0 ffff880100000001
[55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900 ffff8800375d8f00
[55921.452623] Call Trace:
[55921.452647]  [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
[55921.452674]  [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
[55921.452702]  [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
[55921.452732]  [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
[55921.452759]  [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
[55921.452786]  [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
[55921.452814]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
c 24 18
[55921.453030] RIP  [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.453057]  RSP <ffff88012aa61d58>
[55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
[55921.454368] JBD: Ignoring recovery information on journal
[55921.461099] general protection fault: 0000 [#2] SMP 
[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.461338] CPU 1 
[55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.464840] 
[55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G      D     2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.464990] RIP: 0010:[<ffffffff810dffaa>]  [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP: 0018:ffff880103e21ba8  EFLAGS: 00010006
[55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX: ffffffffa0449421
[55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI: 0000000000000004
[55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09: 0000000000000000
[55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12: 000000000000000a
[55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15: 0000000000000246
[55921.465065] FS:  00007fee60afe720(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.465065] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4: 00000000000006e0
[55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000, task ffff88012ca24420)
[55921.465065] Stack:
[55921.465065]  0000000000000000 ffffffffa0449421 ffff88012cfaf108 ffff88012cfaf000
[55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000 ffff88012ca24420
[55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000 ffffffffa044ccec
[55921.465065] Call Trace:
[55921.465065]  [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065]  [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065]  [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1 [ocfs2]
[55921.465065]  [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101 [ocfs2]
[55921.465065]  [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
[55921.465065]  [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
[55921.465065]  [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
[55921.465065]  [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
[55921.465065]  [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
[55921.465065]  [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
[55921.465065]  [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
[55921.465065]  [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
[55921.465065]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1 
[55921.465065] RIP  [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065]  RSP <ffff880103e21ba8>
[55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
[55941.839304] o2net: accepted connection from node mail02.fxclub.org (num 1) at 192.168.1.2:7777
[55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
[55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1


Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.645448] general protection fault: 0000 [#3] SMP 

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Stack:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Call Trace:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1

--- End Message ---
--- Begin Message ---
Version: 2.6.32-41
tags 616726 - unreproducible
quit

Tim Stoop wrote:

> We're currently using the linux-image-2.6.32-5-amd64 package
> (2.6.32-41) and we haven't seen the problem since. So it looks like
> it's solved.

Thanks, both.  Marking accordingly.


--- End Message ---

Reply to: