Here, the noticable lines are IMHO Raw_Read_Error_Rate (208245592 vs. 117642848) Command_Timeout (8 14 17 vs. 0 0 0) UDMA_CRC_Error_Count (11058 vs. 29) Do these numbers indicate a serious problem with my /dev/sda drive? And is it a disk problem or a transmission problem? UDMA_CRC_Error_Count sounds like a cable problem for me, right? BTW, for a year so I had problems with /dev/sda every couple of month, where the kernel set the drive status in the RAID array to failed. I could always fix the problem by hot-plugging out the drive, wiggling the SATA cable, re-inserting and re-adding the drive (without any impact on the running server). Now, I haven't seen the problem for quite a while. My suspect is that the cable is still not working very good, but failures are not often enough to set the drive to "failed" status. urs
I switched from Seagate to WD Red years ago since I couldn't get them to last more than a year or so. I have one WD that is 6.87 years old with no errors. Well past the 5 year life expectancy. In recent years WD has pulled a marketing controversy on their Red drives. See:
So be careful to get the Pro version if you decide to try WD. I
use the WD4003FFBX (4T) drives (Raid 1) and have them at 2.8 years
running 24/7 with no problems.
If you value your data get another drive NOW .. they are already
5 and 5.8 years old! Add it to the array and let it settle in
(sync) and see what happens. I hope your existing array can hold
together long enough to add a 3rd drive. I would have replaced
those drives long ago from all the errors reported. You might
want to get new cables also since you have had problems in the
past.
I also run self tests weekly to make sure the drives are ok. I
run smartctl -a daily also. I also run backuppc on a separate
server to get backups of important data.
There are some programs in /usr/share/mdadm that can check an array but I would wait until you have a new drive added to the array before testing the array. Here is the warning that comes with another script I found:
----------------------------------------
DATA LOSS MAY HAVE OCCURRED.
This condition may have been caused by one of more of the
following events:
. A LEGITIMATE write to a memory mapped file or swap partition
backed by a
RAID1 (and only a RAID1) device - see the md(4) man page for
details.
. A power failure when the array was being written-to.
Data corruption by a hard disk drive, drive controller, cable
etc.
. A kernel bug in the md or storage subsystems etc.
. An array being forcibly created in an inconsistent state using
--assume-clean
This count is updated when the md subsystem carries out a 'check'
or
'repair' action. In the case of 'repair' it reflects the number
of
mismatched blocks prior to carrying out the repair.
Once you have fixed the error, carry out a 'check' action to reset
the count
to zero.
See the md (section 4) manual page, and the following URL for
details:
https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
--------------------------
The problem is that if a miss count occurs then which drive (Raid
1) is correct! I also run programs like debsums to check programs
after an update so I know there is no bit rot in important
programs as explained above.
Hope this helps.
--