[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition



Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important


Hello.

First of all - this it my first bugreport to debian and I sorry if I do something wrong - just tell me what need to fix in it.

I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.

I report bug for a package linux-image-2.6.35.6 but it is not true - I have this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what couse this reboots.

What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0

2) Sync it
drbdadm -- --overwrite-data-of-peer primary drbd0
drbdsetup /dev/drbd0 syncer -r 110M

3) Make both primary 
drbdadm primary drbd0

4) Make FS
mkfs.ocfs2 -L ocfs2_drbd -N 2 -T mail --fs-feature-level=max-features /dev/drbd0

5) Mount it on both nodes
mount /var/spool/dovecot
(fstab options -  nodev,noauto,noatime,data=writeback)

6) Make folders for test
mkdir /var/spool/dovecot/iozone1
mkdir /var/spool/dovecot/iozone2

7) Start IO test on both nodes in different folders
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls

8) Allways got reboot after 30-180 min. Sometimes with stack trace and halt but not everytime.

OCFS2 partition seems to work ok at normal work.

P.S. If i was wrong to write this in sid like system - just tell me. This bug easly repeatable on stable or testing.

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages linux-image-2.6.35.6 depends on:
ii  coreutils                     8.5-1      GNU core utilities
ii  debconf [debconf-2.0]         1.5.35     Debian configuration management sy

linux-image-2.6.35.6 recommends no packages.

Versions of packages linux-image-2.6.35.6 suggests:
pn  fdutils                       <none>     (no description available)
pn  ksymoops                      <none>     (no description available)
pn  linux-doc-2.6.35.6 | linux-so <none>     (no description available)
pn  linux-image-2.6.35.6-dbg      <none>     (no description available)

-- debconf information:
  linux-image-2.6.35.6/postinst/old-dir-initrd-link-2.6.35.6: true
  linux-image-2.6.35.6/prerm/removing-running-kernel-2.6.35.6: true
  linux-image-2.6.35.6/preinst/abort-overwrite-2.6.35.6:
  linux-image-2.6.35.6/postinst/old-system-map-link-2.6.35.6: true
  linux-image-2.6.35.6/preinst/already-running-this-2.6.35.6:
  linux-image-2.6.35.6/preinst/overwriting-modules-2.6.35.6: true
  linux-image-2.6.35.6/postinst/depmod-error-initrd-2.6.35.6: false
  linux-image-2.6.35.6/postinst/kimage-is-a-directory:
  linux-image-2.6.35.6/preinst/failed-to-move-modules-2.6.35.6:
  linux-image-2.6.35.6/postinst/depmod-error-2.6.35.6: false
node:
        ip_port = 7777
        ip_address = 192.168.1.1
        number = 0
        name = mail01.fxclub.org
        cluster = ocfs2
 
node:
        ip_port = 7777
        ip_address = 192.168.1.2
        number = 1
        name = mail02.fxclub.org
        cluster = ocfs2
 
cluster:
        node_count = 2
        name = ocfs2
resource drbd0 {
 
on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}
 
on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}
 
}
global {
	usage-count yes;
	# minor-count dialog-refresh disable-ip-verification
}
 
common {
	protocol C;
 
	handlers {
                # What should be done in case the node is primary, degraded (=no connection) and has inconsistent data.
                #pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                #pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /sbin/ifconfig eth1 down";
                # The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
                #pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                #pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /sbin/ifconfig eth1 down";
		#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
		#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
		# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
	}
 
	startup {
		wfc-timeout 60;
		degr-wfc-timeout 30;
		outdated-wfc-timeout 15;
		become-primary-on both;
		# wait-after-sb;
	}
 
	disk {
		fencing resource-and-stonith;
		# RAID WITH BBU ONLY!!!
		no-disk-flushes;
		no-md-flushes;
		no-disk-barrier;
		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
		# no-disk-drain no-md-flushes max-bio-bvecs   
	}
 
	net {
		cram-hmac-alg sha1;
		shared-secret "password";
		allow-two-primaries;
		ping-timeout 20;
		#after-sb-0pri discard-zero-changes;
		#after-sb-1pri discard-secondary;
		#after-sb-2pri disconnect;	
		data-integrity-alg sha1;
                # Tuning
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 0;
		# snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
	}
 
	syncer {
		# MagaBYTE! Not Bit.
		rate 40M;
		al-extents 3389;
		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
	}
}
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
  Network idle timeout: 15000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Stable:
Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173794] ------------[ cut here ]------------

Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173872] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail02 at Sep 16 09:03:19 ...
 kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt


Testing:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Stack:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Call Trace:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
 kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c 8b 4f

Testing: 2.6.35 + DRBD 8.3.8
mail01:/usr/local/sbin# mount /var/spool/dovecot

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451530] invalid opcode: 0000 [#1] SMP 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452451] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452623] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4c 24 18 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.461099] general protection fault: 0000 [#2] SMP 

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin# 
Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
 kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1


55921.451479] ------------[ cut here ]------------
[55921.451506] kernel BUG at mm/slub.c:2834!
[55921.451530] invalid opcode: 0000 [#1] SMP 
[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.451584] CPU 1 
[55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm
_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.451964] 
[55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.452027] RIP: 0010:[<ffffffff810df05d>]  [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.452076] RSP: 0018:ffff88012aa61d58  EFLAGS: 00010246
[55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX: 0000000000000002
[55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000001
[55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09: 0000000000000000
[55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12: ffffffff810f061e
[55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15: ffff88012bce1070
[55921.452248] FS:  00007f7646a967a0(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.452293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4: 00000000000006e0
[55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task ffff880121f4d890)
[55921.452451] Stack:
[55921.452471]  0000000000000000 ffff8800375d8f00 ffff88012bce1070 ffffffff810f061e
[55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0 ffff880100000001
[55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900 ffff8800375d8f00
[55921.452623] Call Trace:
[55921.452647]  [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
[55921.452674]  [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
[55921.452702]  [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
[55921.452732]  [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
[55921.452759]  [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
[55921.452786]  [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
[55921.452814]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
c 24 18
[55921.453030] RIP  [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.453057]  RSP <ffff88012aa61d58>
[55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
[55921.454368] JBD: Ignoring recovery information on journal
[55921.461099] general protection fault: 0000 [#2] SMP 
[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.461338] CPU 1 
[55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.464840] 
[55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G      D     2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.464990] RIP: 0010:[<ffffffff810dffaa>]  [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP: 0018:ffff880103e21ba8  EFLAGS: 00010006
[55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX: ffffffffa0449421
[55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI: 0000000000000004
[55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09: 0000000000000000
[55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12: 000000000000000a
[55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15: 0000000000000246
[55921.465065] FS:  00007fee60afe720(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.465065] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4: 00000000000006e0
[55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000, task ffff88012ca24420)
[55921.465065] Stack:
[55921.465065]  0000000000000000 ffffffffa0449421 ffff88012cfaf108 ffff88012cfaf000
[55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000 ffff88012ca24420
[55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000 ffffffffa044ccec
[55921.465065] Call Trace:
[55921.465065]  [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065]  [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065]  [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1 [ocfs2]
[55921.465065]  [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101 [ocfs2]
[55921.465065]  [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
[55921.465065]  [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
[55921.465065]  [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
[55921.465065]  [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
[55921.465065]  [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
[55921.465065]  [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
[55921.465065]  [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
[55921.465065]  [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
[55921.465065]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1 
[55921.465065] RIP  [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065]  RSP <ffff880103e21ba8>
[55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
[55941.839304] o2net: accepted connection from node mail02.fxclub.org (num 1) at 192.168.1.2:7777
[55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
[55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1


Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.645448] general protection fault: 0000 [#3] SMP 

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Stack:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Call Trace:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
 kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1

Reply to: