Bug#774583: linux-image-3.16.0-0.bpo.4-amd64: Sporadic RAID1 degradation during /usr/share/mdadm/checkarray cron job
Package: linux-image-3.16.0-0.bpo.4-amd64
Version: 3.16.7-ckt2-1~bpo70+1
Severity: important
Dear Maintainer,
* What led up to the situation?
One of my RAID1 arrays sporadically degrades during the checkarray cron job:
Jan 4 00:57:01 nihlus /USR/SBIN/CRON[4367]: (root) CMD (if [ -x /usr/share/mdadm/checkarray ] && [ $(date +%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi)
Jan 4 00:57:01 nihlus kernel: [ 3932.435274] md: data-check of RAID array md0
Jan 4 00:57:01 nihlus kernel: [ 3932.455356] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jan 4 00:57:01 nihlus kernel: [ 3932.469160] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Jan 4 00:57:01 nihlus kernel: [ 3932.524839] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jan 4 00:57:01 nihlus kernel: [ 3932.569568] md: using 128k window, over a total of 262132k.
Jan 4 00:57:03 nihlus kernel: [ 3934.473794] md: md0: data-check done.
Jan 4 00:57:03 nihlus kernel: [ 3934.491622] md: data-check of RAID array md2
Jan 4 00:57:03 nihlus kernel: [ 3934.510850] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jan 4 00:57:03 nihlus mdadm[2289]: RebuildFinished event detected on md device /dev/md/0
Jan 4 00:57:03 nihlus kernel: [ 3934.541334] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jan 4 00:57:03 nihlus kernel: [ 3934.587243] md: using 128k window, over a total of 1952201680k.
[...]
Jan 4 03:35:35 nihlus kernel: [13446.203438] sd 1:0:0:0: [sdb] Unhandled error code
Jan 4 03:35:35 nihlus kernel: [13446.225179] sd 1:0:0:0: [sdb]
Jan 4 03:35:35 nihlus kernel: [13446.239316] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jan 4 03:35:35 nihlus kernel: [13446.265222] sd 1:0:0:0: [sdb] CDB:
Jan 4 03:35:36 nihlus kernel: [13446.280916] Write(10): 2a 00 00 d8 67 08 00 00 20 00
Jan 4 03:35:36 nihlus kernel: [13446.303438] end_request: I/O error, dev sdb, sector 14182152
Jan 4 03:35:36 nihlus kernel: [13446.330133] md/raid1:md2: Disk failure on sdb3, disabling device.
Jan 4 03:35:36 nihlus kernel: [13446.330133] md/raid1:md2: Operation continuing on 1 devices.
Jan 4 03:35:36 nihlus kernel: [13446.401456] md: md2: data-check interrupted.
Jan 4 03:35:36 nihlus kernel: [13446.467913] RAID1 conf printout:
Jan 4 03:35:36 nihlus kernel: [13446.467920] --- wd:1 rd:2
Jan 4 03:35:36 nihlus kernel: [13446.467925] disk 0, wo:0, o:1, dev:sda3
Jan 4 03:35:36 nihlus kernel: [13446.467929] disk 1, wo:1, o:0, dev:sdb3
Jan 4 03:35:36 nihlus kernel: [13446.492871] RAID1 conf printout:
Jan 4 03:35:36 nihlus kernel: [13446.492878] --- wd:1 rd:2
Jan 4 03:35:36 nihlus kernel: [13446.492883] disk 0, wo:0, o:1, dev:sda3
Jan 4 03:35:36 nihlus mdadm[2289]: Fail event detected on md device /dev/md/2
Jan 4 03:35:36 nihlus postfix/pickup[4968]: 3kFPGJ1mCzz1n: uid=0 from=<root>
Jan 4 03:35:36 nihlus postfix/cleanup[5060]: 3kFPGJ1mCzz1n: message-id=<3kFPGJ1mCzz1n@spectre.leuxner.net>
Jan 4 03:35:36 nihlus mdadm[2289]: FailSpare event detected on md device /dev/md/2, component device /dev/sdb3
Jan 4 03:35:36 nihlus mdadm[2289]: RebuildFinished event detected on md device /dev/md/2
# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[2](F)
1952201680 blocks super 1.2 [2/1] [U_]
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
1048564 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
262132 blocks super 1.2 [2/2] [UU]
# smartctl -i /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.16.0-0.bpo.4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA DT01ACA200
Serial Number: [redacted]
LU WWN Device Id: 5 000039 ff3e05ac0
Firmware Version: MX4OABB0
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun Jan 4 18:50:18 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
# smartctl -l selftest /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.16.0-0.bpo.4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1490 -
# 2 Extended offline Completed without error 00% 822 -
# 3 Short offline Completed without error 00% 812 -
* What exactly did you do (or not do) that was effective (or
ineffective)?
# mdadm --manage /dev/md2 --remove /dev/sdb3
# mdadm --manage /dev/md2 --add /dev/sdb3
* What was the outcome of this action?
The array rebuilt without _any_ errors. The drive never went offline during normal operation and also
shows no errors when conducting self-tests. It only sporadically gets removed from the array during
the checkarray job - when a driver timeout occurs.
* What outcome did you expect instead?
No drive degradation during cron job.
-- System Information:
Debian Release: 7.7
APT prefers stable
APT policy: (1001, 'stable'), (500, 'unstable'), (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 3.16.0-0.bpo.4-amd64
Locale: LANG=en_US.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Reply to: