Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important
Hello.
First of all - this it my first bugreport to debian and I sorry if I do something wrong - just tell me what need to fix in it.
I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.
I report bug for a package linux-image-2.6.35.6 but it is not true - I have this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what couse this reboots.
What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0
2) Sync it
drbdadm -- --overwrite-data-of-peer primary drbd0
drbdsetup /dev/drbd0 syncer -r 110M
3) Make both primary
drbdadm primary drbd0
4) Make FS
mkfs.ocfs2 -L ocfs2_drbd -N 2 -T mail --fs-feature-level=max-features /dev/drbd0
5) Mount it on both nodes
mount /var/spool/dovecot
(fstab options - nodev,noauto,noatime,data=writeback)
6) Make folders for test
mkdir /var/spool/dovecot/iozone1
mkdir /var/spool/dovecot/iozone2
7) Start IO test on both nodes in different folders
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls
8) Allways got reboot after 30-180 min. Sometimes with stack trace and halt but not everytime.
OCFS2 partition seems to work ok at normal work.
P.S. If i was wrong to write this in sid like system - just tell me. This bug easly repeatable on stable or testing.
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages linux-image-2.6.35.6 depends on:
ii coreutils 8.5-1 GNU core utilities
ii debconf [debconf-2.0] 1.5.35 Debian configuration management sy
linux-image-2.6.35.6 recommends no packages.
Versions of packages linux-image-2.6.35.6 suggests:
pn fdutils <none> (no description available)
pn ksymoops <none> (no description available)
pn linux-doc-2.6.35.6 | linux-so <none> (no description available)
pn linux-image-2.6.35.6-dbg <none> (no description available)
-- debconf information:
linux-image-2.6.35.6/postinst/old-dir-initrd-link-2.6.35.6: true
linux-image-2.6.35.6/prerm/removing-running-kernel-2.6.35.6: true
linux-image-2.6.35.6/preinst/abort-overwrite-2.6.35.6:
linux-image-2.6.35.6/postinst/old-system-map-link-2.6.35.6: true
linux-image-2.6.35.6/preinst/already-running-this-2.6.35.6:
linux-image-2.6.35.6/preinst/overwriting-modules-2.6.35.6: true
linux-image-2.6.35.6/postinst/depmod-error-initrd-2.6.35.6: false
linux-image-2.6.35.6/postinst/kimage-is-a-directory:
linux-image-2.6.35.6/preinst/failed-to-move-modules-2.6.35.6:
linux-image-2.6.35.6/postinst/depmod-error-2.6.35.6: false
node:
ip_port = 7777
ip_address = 192.168.1.1
number = 0
name = mail01.fxclub.org
cluster = ocfs2
node:
ip_port = 7777
ip_address = 192.168.1.2
number = 1
name = mail02.fxclub.org
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
resource drbd0 {
on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}
on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}
}
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
# What should be done in case the node is primary, degraded (=no connection) and has inconsistent data.
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /sbin/ifconfig eth1 down";
# The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /sbin/ifconfig eth1 down";
#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
wfc-timeout 60;
degr-wfc-timeout 30;
outdated-wfc-timeout 15;
become-primary-on both;
# wait-after-sb;
}
disk {
fencing resource-and-stonith;
# RAID WITH BBU ONLY!!!
no-disk-flushes;
no-md-flushes;
no-disk-barrier;
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
cram-hmac-alg sha1;
shared-secret "password";
allow-two-primaries;
ping-timeout 20;
#after-sb-0pri discard-zero-changes;
#after-sb-1pri discard-secondary;
#after-sb-2pri disconnect;
data-integrity-alg sha1;
# Tuning
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
# snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
}
syncer {
# MagaBYTE! Not Bit.
rate 40M;
al-extents 3389;
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
}
}
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
Network idle timeout: 15000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Stable:
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173794] ------------[ cut here ]------------
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173872] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt
Testing:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310479] ------------[ cut here ]------------
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Stack:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Call Trace:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c 8b 4f
Testing: 2.6.35 + DRBD 8.3.8
mail01:/usr/local/sbin# mount /var/spool/dovecot
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451479] ------------[ cut here ]------------
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451530] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452451] Stack:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452623] Call Trace:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4c 24 18
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461099] general protection fault: 0000 [#2] SMP
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin#
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Stack:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Call Trace:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
55921.451479] ------------[ cut here ]------------
[55921.451506] kernel BUG at mm/slub.c:2834!
[55921.451530] invalid opcode: 0000 [#1] SMP
[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.451584] CPU 1
[55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm
_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.451964]
[55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.452027] RIP: 0010:[<ffffffff810df05d>] [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.452076] RSP: 0018:ffff88012aa61d58 EFLAGS: 00010246
[55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX: 0000000000000002
[55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000001
[55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09: 0000000000000000
[55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12: ffffffff810f061e
[55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15: ffff88012bce1070
[55921.452248] FS: 00007f7646a967a0(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.452293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4: 00000000000006e0
[55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task ffff880121f4d890)
[55921.452451] Stack:
[55921.452471] 0000000000000000 ffff8800375d8f00 ffff88012bce1070 ffffffff810f061e
[55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0 ffff880100000001
[55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900 ffff8800375d8f00
[55921.452623] Call Trace:
[55921.452647] [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
[55921.452674] [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
[55921.452702] [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
[55921.452732] [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
[55921.452759] [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
[55921.452786] [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
[55921.452814] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
c 24 18
[55921.453030] RIP [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.453057] RSP <ffff88012aa61d58>
[55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
[55921.454368] JBD: Ignoring recovery information on journal
[55921.461099] general protection fault: 0000 [#2] SMP
[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.461338] CPU 1
[55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.464840]
[55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G D 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.464990] RIP: 0010:[<ffffffff810dffaa>] [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP: 0018:ffff880103e21ba8 EFLAGS: 00010006
[55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX: ffffffffa0449421
[55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI: 0000000000000004
[55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09: 0000000000000000
[55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12: 000000000000000a
[55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15: 0000000000000246
[55921.465065] FS: 00007fee60afe720(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.465065] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4: 00000000000006e0
[55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000, task ffff88012ca24420)
[55921.465065] Stack:
[55921.465065] 0000000000000000 ffffffffa0449421 ffff88012cfaf108 ffff88012cfaf000
[55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000 ffff88012ca24420
[55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000 ffffffffa044ccec
[55921.465065] Call Trace:
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065] [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1 [ocfs2]
[55921.465065] [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101 [ocfs2]
[55921.465065] [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
[55921.465065] [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
[55921.465065] [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
[55921.465065] [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
[55921.465065] [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
[55921.465065] [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
[55921.465065] [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
[55921.465065] [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
[55921.465065] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
[55921.465065] RIP [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP <ffff880103e21ba8>
[55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
[55941.839304] o2net: accepted connection from node mail02.fxclub.org (num 1) at 192.168.1.2:7777
[55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
[55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645448] general protection fault: 0000 [#3] SMP
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Stack:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Call Trace:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
Reply to: