Bug#675969: [squeeze] kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!

To: Jose Calhariz <jose.calhariz@tagus.ist.utl.pt>
Cc: 675969@bugs.debian.org
Subject: Bug#675969: [squeeze] kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
From: Jonathan Nieder <jrnieder@gmail.com>
Date: Mon, 4 Jun 2012 22:55:29 -0500
Message-id: <[🔎] 20120605035529.GA3118@burratino>
Reply-to: Jonathan Nieder <jrnieder@gmail.com>, 675969@bugs.debian.org
In-reply-to: <[🔎] 20120604160111.27574.49662.reportbug@afs04.ist.utl.pt>
References: <[🔎] 20120604160111.27574.49662.reportbug@afs04.ist.utl.pt>

Hi José,

Jose Calhariz wrote:

> Version: 2.6.32-45
[...]
> The previous night during the periodic mdadm RAID check: 
>  - the linux kernel gave a kernel BUG, 
>  - tried to kick out a failed disk and 
>  - stopped accepting I/O to the affected raid.  
>
> The affected programs were in state D.  The only way to recover was to
> do a reboot.  After reboot the problematic disk was replaced.
>
> This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks. 

Thanks for reporting.  Do you have a test system you can experiment
on or ideas for reproducing it in a VM?

[...]
> build/source_i386_none/drivers/md/raid5.c:2764!
> invalid opcode: 0000 [#1] SMP 
> last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:02:01.0/cciss0/c0d0/block/cciss!c0d0/removable
> Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext4 jbd2 crc16 openafs(P) lp parport_pc parport joydev st sd_mod crc_t10dif ext2 loop tun xt_multiport xfs exportfs 8021q garp stp ip6table_filter ip6_tables iptable_filter ip_tables x_tables ide_generic ide_gd_mod ide_cd_mod ide_core snd_pcm snd_timer hpilo snd soundcore snd_page_alloc hpwdt e752x_edac shpchp rng_core i6300esb edac_core pci_hotplug pcspkr container processor evdev button psmouse serio_raw ext3 jbd mbcache dm_mod raid456 md_mod async_raid6_recov async_pq usbhid hid raid6_pq async_xor xor async_memcpy async_tx sg sr_mod cdrom ata_generic thermal uhci_hcd cciss tg3 floppy ata_piix ehci_hcd libata e1000 usbcore libphy scsi_mod nls_base thermal_sys [last unloaded: openafs]
>
> Pid: 743, comm: md2_raid6 Tainted: P           (2.6.32-5-686 #1) ProLiant DL360 G4
> EIP: 0060:[<f818c811>] EFLAGS: 00010297 CPU: 3
> EIP is at handle_stripe+0x89d/0x173e [raid456]
> EAX: 00000005 EBX: 00000002 ECX: 00000003 EDX: 00000001
> ESI: f6394000 EDI: 00000003 EBP: f6394028 ESP: f58d5e6c
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process md2_raid6 (pid: 743, ti=f58d4000 task=f6569980 task.ti=f58d4000)
> Stack:
>  e6fde3e6 c2988138 00000006 f61c8e00 00000006 0002d995 00020003 00000000
> <0> c2988138 f4cbc86c f65699ac 000f0e67 00000000 f639431c 00000005 fffffffc
> <0> f4cbc86c c1025461 00000000 00000000 00000002 00000005 00988100 c127a45c
> Call Trace:
>  [<c1025461>] ? check_preempt_wakeup+0x196/0x202
>  [<f818d9fb>] ? raid5d+0x349/0x389 [raid456]
>  [<c103b623>] ? del_timer_sync+0xa/0x14
>  [<c103b6cb>] ? process_timeout+0x0/0x5
>  [<f816206e>] ? md_thread+0xe1/0xf8 [md_mod]
>  [<c104433a>] ? autoremove_wake_function+0x0/0x2d
>  [<f8161f8d>] ? md_thread+0x0/0xf8 [md_mod]
>  [<c1044108>] ? kthread+0x61/0x66
>  [<c10440a7>] ? kthread+0x0/0x66
>  [<c1003d47>] ? kernel_thread_helper+0x7/0x10
> Code: e9 9b 01 00 00 83 7c 24 7c 02 74 04 0f 0b eb fe f6 46 28 10 c7 46 3c 00 00 00 00 0f 85 7f 01 00 00 8b 44 24 38 39 44 24 70 7d 04 <0f> 0b eb fe 83 7c 24 7c 02 75 20 6b 84 24 a8 00 00 00 78 ff 44 
> EIP: [<f818c811>] handle_stripe+0x89d/0x173e [raid456] SS:ESP 0068:f58d5e6c

If I am reading correctly, this is

	case check_state_compute_result:
		sh->check_state = check_state_idle;

		/* check that a write has not made the stripe insync */
		if (test_bit(STRIPE_INSYNC, &sh->state))
			break;

		/* now write out any block on a failed drive,
		 * or P or Q if they were recomputed
		 */
		BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */

from the call chain

  handle_stripe -> handle_stripe6 -> handle_parity_checks6.

I would be happy if v3.2-rc5~4^2~8 (md/raid5: abort any pending parity
operations when array fails, 2011-11-08) or some related change would
have fixed it, but alas, that patch is already in 2.6.32-40.  So I
don't have many ideas yet.  Please attach a log from booting up the
kernel in the same boot as the BUG above.

Hope that helps,
Jonathan

Reply to:

Follow-Ups:
- Bug#675969: [squeeze] kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
  - From: Jose Manuel dos Santos Calhariz <jose.spam@netvisao.pt>

References:
- Bug#675969: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
  - From: Jose Calhariz <jose.calhariz@tagus.ist.utl.pt>

Prev by Date: Bug#611107: ath5k phy0: noise floor calibration timeout
Next by Date: Re: Processed: reassign 676001 to busybox
Previous by thread: Bug#675969: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
Next by thread: Bug#675969: [squeeze] kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
Index(es):
- Date
- Thread