[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#584881:



We seem to be hitting a similar problem on 2 different machines. When the monthly checkarray script is run, the resync blocks on the raid partition that holds the LVMs. 

Both machines have three disks, running raid10 across partitions. As /proc/mdstat shows, it has hit 0K/sec. Smartctl reports no errors for any of the drives. The machines are running squeeze with all packages up to date. /dev/md2 holds an LVM partition that is used for Xen disks. Load average is above 100. The machines run ganeti, with xen-pvm, xen-hvm as the hypervisors and there is some drbd mirroring between these machines, for some of the logical volumes. lvdisplay, vgdisplay, pvdisplay all hang when run. Most of the VMs show very high load as well (through ganglia and snmp reporting) but most are not accessible via ssh or xm console.

Is there any other information I can provide to help to debug this?

# cat /proc/mdstat
Personalities : [raid10]
md2 : active raid10 sda3[0] sdc3[2] sdb3[1]
      1448908608 blocks super 1.2 64K chunks 2 near-copies [3/3] [UUU]
      [====>................]  check = 23.1% (335167872/1448908608) finish=156694820.0min speed=0K/sec

md0 : active raid10 sda1[0] sdc1[2] sdb1[1]
      14644736 blocks super 1.2 512K chunks 2 near-copies [3/3] [UUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

 # cat /proc/`pidof mdadm`/status
Name:   mdadm
State:  S (sleeping)
Tgid:   3298
Pid:    3298
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
VmPeak:    12832 kB
VmSize:    12768 kB
VmLck:         0 kB
VmHWM:       768 kB
VmRSS:       604 kB
VmData:      364 kB
VmStk:        88 kB
VmExe:       316 kB
VmLib:      1692 kB
VmPTE:        48 kB
Threads:        1
SigQ:   9/7244
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000002
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        5739
nonvoluntary_ctxt_switches:     12

[5391047.833632] md: data-check of RAID array md0
[5391047.833636] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[5391047.833639] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[5391047.833644] md: using 128k window, over a total of 14644736 blocks.
[5391048.277677] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
[5391235.279026] md: md0: data-check done.
[5391235.496633] md: data-check of RAID array md2
[5391235.496638] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[5391235.496641] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[5391235.496647] md: using 128k window, over a total of 1448908608 blocks.
[5410976.055527] INFO: task kdmflush:1035 blocked for more than 120 seconds.
[5410976.055566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5410976.055619] kdmflush      D ffff880002818e08     0  1035      2 0x00000000
[5410976.055625]  ffff88002e1a5530 0000000000000246 0000000000000002 0000000000000010
[5410976.055633]  0000000000000000 ffff880002d81e80 000000000000f9e0 ffff88003e44dfd8
[5410976.055640]  0000000000015780 0000000000015780 ffff88003ec59530 ffff88003ec59828
[5410976.055647] Call Trace:
[5410976.055657]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[5410976.055665]  [<ffffffff811804bb>] ? generic_unplug_device+0x0/0x34
[5410976.055680]  [<ffffffffa020e6f0>] ? wait_barrier+0x9a/0xd7 [raid10]
[5410976.055685]  [<ffffffff8104b430>] ? default_wake_function+0x0/0x9
[5410976.055691]  [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268
[5410976.055698]  [<ffffffffa0210fa2>] ? make_request+0x16f/0x5cd [raid10]
[5410976.055703]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
[5410976.055709]  [<ffffffff810e81c5>] ? kmem_cache_alloc+0x8c/0xf0
[5410976.055717]  [<ffffffffa01f5b9a>] ? md_make_request+0xb6/0xf1 [md_mod]
[5410976.055723]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
[5410976.055728]  [<ffffffff8117f6b7>] ? generic_make_request+0x299/0x2f9
[5410976.055737]  [<ffffffffa021a308>] ? clone_bio+0x44/0xce [dm_mod]
[5410976.055745]  [<ffffffffa021b5e9>] ? __split_and_process_bio+0x2ac/0x56b [dm_mod]
[5410976.055753]  [<ffffffffa021ba38>] ? dm_wq_work+0x137/0x167 [dm_mod]
[5410976.055760]  [<ffffffff810628d3>] ? worker_thread+0x188/0x21d
[5410976.055768]  [<ffffffffa021b901>] ? dm_wq_work+0x0/0x167 [dm_mod]
[5410976.055773]  [<ffffffff81065f06>] ? autoremove_wake_function+0x0/0x2e
[5410976.055778]  [<ffffffff8106274b>] ? worker_thread+0x0/0x21d
[5410976.055783]  [<ffffffff81065c39>] ? kthread+0x79/0x81
[5410976.055788]  [<ffffffff81012baa>] ? child_rip+0xa/0x20
[5410976.055793]  [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
[5410976.055798]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
[5410976.055803]  [<ffffffff81012ba0>] ? child_rip+0x0/0x20

# free
             total       used       free     shared    buffers     cached
Mem:       1045340    1026512      18828          0      61688     401856
-/+ buffers/cache:     562968     482372
Swap:      3161640      29072    3132568

# mount
/dev/md0 on / type ext4 (rw,errors=remount-ro)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
xenfs on /proc/xen type xenfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)

# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000600d9

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        1216     9764864   fd  Linux raid autodetect
/dev/sda2            1216        1347     1053889+  82  Linux swap / Solaris
/dev/sda3            1348      121601   965940255   fd  Linux raid autodetect

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0007482c

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1        1216     9764864   fd  Linux raid autodetect
/dev/sdc2            1216        1347     1053889+  82  Linux swap / Solaris
/dev/sdc3            1348      121601   965940255   fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0006adc2

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1        1216     9764864   fd  Linux raid autodetect
/dev/sdb2            1216        1347     1053889+  82  Linux swap / Solaris
/dev/sdb3            1348      121601   965940255   fd  Linux raid autodetect

# uname -a
Linux barwon 2.6.32-5-xen-amd64 #1 SMP Thu May 19 01:16:47 UTC 2011 x86_64 GNU/Linux

-- 
Marcus Furlong - VPAC Systems Administrator
http://www.vpac.org
+61 3 9925 4574



Reply to: