[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#717681: linux-image-3.10-1-amd64: reproducable Data loss with kernel linux-image-3.10-1-amd64 with md-raid devices



On Wed, 2013-07-24 at 13:34 +1000, NeilBrown wrote:
> On Wed, 24 Jul 2013 04:02:15 +0100 Ben Hutchings <ben@decadent.org.uk> wrote:
> 
> > Neil, does the report below sound like the bug you fixed with:
> > 
> > commit 7bb23c4934059c64cbee2e41d5d24ce122285176
> > Author: NeilBrown <neilb@suse.de>
> > Date:   Tue Jul 16 16:50:47 2013 +1000
> > 
> >     md/raid10: fix two problems with RAID10 resync.
> > 
> > or any of the others you've recently fixed?
> 
> Yes.  The bug below sounds exactly like the one fixed by the commit above.
> 
> NeilBrown

Thanks very much.  Thomas, please test the attached patch, following the
instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official>.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
From: NeilBrown <neilb@suse.de>
Date: Tue, 16 Jul 2013 16:50:47 +1000
Subject: md/raid10: fix two problems with RAID10 resync.
Origin: https://git.kernel.org/linus/7bb23c4934059c64cbee2e41d5d24ce122285176

1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2c9d54d6 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid10.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2075,11 +2075,17 @@ static void sync_request_write(struct md
 			 * both 'first' and 'i', so we just compare them.
 			 * All vec entries are PAGE_SIZE;
 			 */
-			for (j = 0; j < vcnt; j++)
+			int sectors = r10_bio->sectors;
+			for (j = 0; j < vcnt; j++) {
+				int len = PAGE_SIZE;
+				if (sectors < (len / 512))
+					len = sectors * 512;
 				if (memcmp(page_address(fbio->bi_io_vec[j].bv_page),
 					   page_address(tbio->bi_io_vec[j].bv_page),
-					   fbio->bi_io_vec[j].bv_len))
+					   len))
 					break;
+				sectors -= len/512;
+			}
 			if (j == vcnt)
 				continue;
 			atomic64_add(r10_bio->sectors, &mddev->resync_mismatches);
@@ -3386,6 +3392,7 @@ static sector_t sync_request(struct mdde
 
 		if (bio->bi_end_io == end_sync_read) {
 			md_sync_acct(bio->bi_bdev, nr_sectors);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
 			generic_make_request(bio);
 		}
 	}

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: