[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RAID-1 and disk I/O



On 7/17/21 08:34, Urs Thuermann wrote:
Here, the noticable lines are IMHO

    Raw_Read_Error_Rate     (208245592 vs. 117642848)
    Command_Timeout         (8 14 17 vs. 0 0 0)
    UDMA_CRC_Error_Count    (11058 vs. 29)

Do these numbers indicate a serious problem with my /dev/sda drive?
And is it a disk problem or a transmission problem?
UDMA_CRC_Error_Count sounds like a cable problem for me, right?

BTW, for a year so I had problems with /dev/sda every couple of month,
where the kernel set the drive status in the RAID array to failed.  I
could always fix the problem by hot-plugging out the drive, wiggling
the SATA cable, re-inserting and re-adding the drive (without any
impact on the running server).  Now, I haven't seen the problem for
quite a while.  My suspect is that the cable is still not working very
good, but failures are not often enough to set the drive to "failed"
status.

urs

I switched from Seagate to WD Red years ago since I couldn't get them to last more than a year or so.  I have one WD that is 6.87 years old with no errors.  Well past the 5 year life expectancy.  In recent years WD has pulled a marketing controversy on their Red drives.  See:

https://arstechnica.com/gadgets/2020/06/western-digital-adds-red-plus-branding-for-non-smr-hard-drives/

So be careful to get the Pro version if you decide to try WD. I use the WD4003FFBX (4T) drives (Raid 1) and have them at 2.8 years running 24/7 with no problems.

If you value your data get another drive NOW .. they are already 5 and 5.8 years old!  Add it to the array and let it settle in (sync) and see what happens.  I hope your existing array can hold together long enough to add a 3rd drive.  I would have replaced those drives long ago from all the errors reported.  You might want to get new cables also since you have had problems in the past.  

I also run self tests weekly to make sure the drives are ok.  I run smartctl -a daily also.  I also run backuppc on a separate server to get backups of important data.

There are some programs in /usr/share/mdadm that can check an array but I would wait until you have a new drive added to the array before testing the array.  Here is the warning that comes with another script I found:

----------------------------------------

DATA LOSS MAY HAVE OCCURRED.

This condition may have been caused by one of more of the following events:

. A LEGITIMATE write to a memory mapped file or swap partition backed by a
    RAID1 (and only a RAID1) device - see the md(4) man page for details.

. A power failure when the array was being written-to.
  Data corruption by a hard disk drive, drive controller, cable etc.

. A kernel bug in the md or storage subsystems etc.

. An array being forcibly created in an inconsistent state using --assume-clean

This count is updated when the md subsystem carries out a 'check' or
'repair' action.  In the case of 'repair' it reflects the number of
mismatched blocks prior to carrying out the repair.

Once you have fixed the error, carry out a 'check' action to reset the count
to zero.

See the md (section 4) manual page, and the following URL for details:

https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ

--------------------------

The problem is that if a miss count occurs then which drive (Raid 1) is correct!  I also run programs like debsums to check programs after an update so I know there is no bit rot in important programs as explained above.

Hope this helps.

--



...Bob

Reply to: