[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug with soft raid?



Hello Andy,

Thank you very much for your lengthy and very informative answer.

After some investigation, I discovered that it was /dev/sdc that had
some problems. So I took it out of the Rais 1 array. But this didn't
really help since I got other freeze.

grep "120 seconds" kern.log
Feb 18 16:16:38 box kernel: [30209.474017] INFO: task md1_raid1:467 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474151] INFO: task md0_raid1:470 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474250] INFO: task jbd2/md0-8:982 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474447] INFO: task jbd2/md1-8:988 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474721] INFO: task configmgrWriter:26206 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474944] INFO: task kworker/u56:1:25006 blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.475150] INFO: task kworker/u56:2:26207 blocked for more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.307956] INFO: task md1_raid1:467 blocked for more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.308088] INFO: task md0_raid1:470 blocked for more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.308188] INFO: task jbd2/md0-8:982 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.751926] INFO: task md0_raid1:412 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752059] INFO: task md1_raid1:416 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752158] INFO: task jbd2/md1-8:988 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752348] INFO: task jbd2/md0-8:993 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752513] INFO: task uptimed:1174 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752743] INFO: task fetchmail:3121 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752990] INFO: task offlineimap:4247 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.753195] INFO: task kworker/u56:0:10116 blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.753390] INFO: task kworker/u56:2:11869 blocked for more than 120 seconds.
Feb 19 11:05:22 box kernel: [ 8338.585502] INFO: task md0_raid1:412 blocked for more than 120 seconds.

On Fri, Feb 15, 2019 at 09:35:27AM +0100, steve wrote:
>for i in /dev/sd{b..f}; do echo "DISK: ${i}"; smartctl -l scterc "${i}"; sleep 3; done

I get this for sdb and sdc

SCT Error Recovery Control:
          Read: Disabled
         Write: Disabled

and this for sdf

SCT Error Recovery Control:
          Read:     70 (7.0 seconds)
         Write:     70 (7.0 seconds)

What does it tell me ?

It means that sd[bc] may support SCTERC but it's disabled (promising),
and sdf does support it and it's set to 7 seconds (good).

For disks in Linux software RAID, SCTERC with a low timeout is
essential. If it's not possible then the block layer timeout for the
device should be increased.

You should try to set SCTERC for sd[bc] like so:

# for dev in /dev/sd[cd]; do smartctl -l scterc,70,70 "$dev"; done

I tried this:

smartctl -l scterc,70,70 /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control set to:
          Read:     70 (7.0 seconds)
         Write:     70 (7.0 seconds)

But then

smartctl -l scterc /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Unexpected SCT status 0x0046 (action_code=3, function_code=2)
SCT (Get) Error Recovery Control command failed


Which is weird…


If that works then great - all your drives support SCTERC and have low
timeouts.

If setting it to 70 (centiseconds, so 7 seconds) doesn't work then you
will need to increase the block layer timeout like this:

cat /sys/block/sdb/device/timeout 30


echo 180 > /sys/block/sdb/device/timeout

Let's see if it helps.


I am here in a field that I don't master at all, so just following your advices.


Will let you know.

Thank you

Best,
Steve


Reply to: