Software RAID blocks
Hi!
TLDR;
My /home on dmcrypt -> software Raid5 blocks irregular usually without
any error messages.
I can get it going again with "fdisk -l /dev/sdx".
Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
a dm-softraid or a hardware issue?
---------------------------------------------------------------
Long version:
My /home "partition" is a dmcrypt on software RAID5 with 5 SATA disks.
See System info further down in this mail.
Once in a while user programs freeze because the dmcrypt or something
else further down the chain blocks during a write? on /home.
Am I lycky and had a running root shell open I can run a
$ fdisk -l /dev/sdx
to one of the harddisks in the RAID and the block disappears instantly.
I checked if it could be a spindown power management problem but all
disks which have a PM feature have it disabled. So I don't think this is
the problem.
Last night I got a "blocked for more than 300 seconds." message in syslog -
see <https://paste.debian.net/1060134/ <https://paste.debian.net/1060134/>> (link valid for 90 days).
Log summary:
Jan 13 02:34:44 osprey kernel: [969696.242745] INFO: task md127_raid5:238 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.242772] Call Trace:
Jan 13 02:34:44 osprey kernel: [969696.242789] ? __schedule+0x2a2/0x870
Jan 13 02:34:44 osprey kernel: [969696.242995] INFO: task dmcrypt_write:904 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243223] INFO: task jbd2/dm-2-8:917 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243525] INFO: task mpc:6622 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243997] INFO: task kworker/u8:0:6625 blocked for more than 300 seconds.
In this case I did a
$ fdisk -l /dev/sdf
and everything worked again.
As I understand the log mpc (user program) started and maybe accessed the
config file on /home. The ext4 tried to save the new access time which
got down the chain jbd2 -> dmcrypt and blocked in the end in md127_raid5.
So it is most likely that I have a problem with the software raid or the
harddisks, isn't it? SMART is activated on all disks and does not show
any error.
How can I debug this further to solve the problem? Thanks in advance for
your suggestions.
Tom
---------------------------------------------------------------
System info:
============
Debian testing
$ uname -a
Linux osprey 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux
$ lsblk -i
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 74.5G 0 disk
|-sda1 8:1 0 4G 0 part
| `-cswap1 253:1 0 4G 0 crypt [SWAP]
`-sda2 8:2 0 70.5G 0 part
`-osprey_root 253:0 0 70.5G 0 crypt /
sdb 8:16 0 2.7T 0 disk
`-sdb1 8:17 0 2.7T 0 part
`-md127 9:127 0 10.9T 0 raid5
`-osprey_home 253:2 0 10.9T 0 crypt /home
sdc 8:32 0 2.7T 0 disk
`-sdc1 8:33 0 2.7T 0 part
`-md127 9:127 0 10.9T 0 raid5
`-osprey_home 253:2 0 10.9T 0 crypt /home
sdd 8:48 0 2.7T 0 disk
`-sdd1 8:49 0 2.7T 0 part
`-md127 9:127 0 10.9T 0 raid5
`-osprey_home 253:2 0 10.9T 0 crypt /home
sde 8:64 0 2.7T 0 disk
`-sde1 8:65 0 2.7T 0 part
`-md127 9:127 0 10.9T 0 raid5
`-osprey_home 253:2 0 10.9T 0 crypt /home
sdf 8:80 0 2.7T 0 disk
`-sdf1 8:81 0 2.7T 0 part
`-md127 9:127 0 10.9T 0 raid5
`-osprey_home 253:2 0 10.9T 0 crypt /home
$ sdparm --get STANDBY /dev/sd[bcdef]
/dev/sdb: ATA ST3000VN000-1H41 SC43
STANDBY not found in Power condition [po] mode page
/dev/sdc: ATA WDC WD30EURX-63T 0A80
STANDBY not found in Power condition [po] mode page
/dev/sdd: ATA TOSHIBA DT01ACA3 ABB0
STANDBY not found in Power condition [po] mode page
/dev/sde: ATA ST3000DM001-1CH1 CC27
STANDBY not found in Power condition [po] mode page
/dev/sdf: ATA WDC WD30EFRX-68E 0A80
STANDBY not found in Power condition [po] mode page
$ hdparm -B /dev/sd[bcdef]
/dev/sdb:
APM_level = 254
/dev/sdc:
APM_level = not supported
/dev/sdd:
APM_level = off
/dev/sde:
APM_level = 254
/dev/sdf:
APM_level = not supported
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md127 : active raid5 sdc1[1] sdd1[2] sdb1[0] sdf1[5] sde1[3]
11719766016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 1/22 pages [4KB], 65536KB chunk
unused devices: <none>
$ for i in {b..f}; do echo "DISK: ${i}"; smartctl -a "/dev/sd${i}" |grep "SMART overall-health self-assessment test result"; done
DISK: b
SMART overall-health self-assessment test result: PASSED
DISK: c
SMART overall-health self-assessment test result: PASSED
DISK: d
SMART overall-health self-assessment test result: PASSED
DISK: e
SMART overall-health self-assessment test result: PASSED
DISK: f
SMART overall-health self-assessment test result: PASSED
Reply to: