[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

raid1 issue after disk failure: both disks of the array are still active



Hi,
I have a raid1 array with two disks, distro is Squeeze amd64. /dev/sda is slowly dying, here is a snippet of "smartctl -a /dev/sda":

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 1

The bad sector is in the second half-MB of the disk, in fact with "dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1" I get this output in /var/log/syslog:

root@asterisk:~# dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1
0+1 record dentro
0+1 record fuori
430140 byte (430 kB) copiati, 11,7265 s, 36,7 kB/s

Sep 12 22:15:02 asterisk kernel: [ 8921.561978] dd: sending ioctl 80306d02 to a partition! Sep 12 22:15:02 asterisk kernel: [ 8921.561986] dd: sending ioctl 80306d02 to a partition! Sep 12 22:15:03 asterisk kernel: [ 8922.529099] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:03 asterisk kernel: [ 8922.531774] ata3.00: BMDMA stat 0x44
Sep 12 22:15:03 asterisk kernel: [ 8922.533547] ata3.00: failed command: READ DMA Sep 12 22:15:03 asterisk kernel: [ 8922.535313] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:03 asterisk kernel: [ 8922.535316] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:03 asterisk kernel: [ 8922.538891] ata3.00: status: { DRDY ERR }
Sep 12 22:15:03 asterisk kernel: [ 8922.540675] ata3.00: error: { UNC }
Sep 12 22:15:04 asterisk kernel: [ 8923.508206] ata3.00: configured for UDMA/133
Sep 12 22:15:04 asterisk kernel: [ 8923.508220] ata3: EH complete
Sep 12 22:15:05 asterisk kernel: [ 8924.469512] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:05 asterisk kernel: [ 8924.472323] ata3.00: BMDMA stat 0x44
Sep 12 22:15:05 asterisk kernel: [ 8924.475260] ata3.00: failed command: READ DMA Sep 12 22:15:05 asterisk kernel: [ 8924.477023] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:05 asterisk kernel: [ 8924.477025] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:05 asterisk kernel: [ 8924.480595] ata3.00: status: { DRDY ERR }
Sep 12 22:15:05 asterisk kernel: [ 8924.482370] ata3.00: error: { UNC }
Sep 12 22:15:06 asterisk kernel: [ 8925.452209] ata3.00: configured for UDMA/133
Sep 12 22:15:06 asterisk kernel: [ 8925.452224] ata3: EH complete
Sep 12 22:15:07 asterisk kernel: [ 8926.418504] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:07 asterisk kernel: [ 8926.420741] ata3.00: BMDMA stat 0x44
Sep 12 22:15:07 asterisk kernel: [ 8926.422486] ata3.00: failed command: READ DMA Sep 12 22:15:07 asterisk kernel: [ 8926.424279] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:07 asterisk kernel: [ 8926.424281] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:07 asterisk kernel: [ 8926.427861] ata3.00: status: { DRDY ERR }
Sep 12 22:15:07 asterisk kernel: [ 8926.429660] ata3.00: error: { UNC }
Sep 12 22:15:08 asterisk kernel: [ 8927.396270] ata3.00: configured for UDMA/133
Sep 12 22:15:08 asterisk kernel: [ 8927.396285] ata3: EH complete
Sep 12 22:15:09 asterisk kernel: [ 8928.359173] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:09 asterisk kernel: [ 8928.361647] ata3.00: BMDMA stat 0x44
Sep 12 22:15:09 asterisk kernel: [ 8928.364273] ata3.00: failed command: READ DMA Sep 12 22:15:09 asterisk kernel: [ 8928.366028] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:09 asterisk kernel: [ 8928.366030] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:09 asterisk kernel: [ 8928.369643] ata3.00: status: { DRDY ERR }
Sep 12 22:15:09 asterisk kernel: [ 8928.371420] ata3.00: error: { UNC }
Sep 12 22:15:10 asterisk kernel: [ 8929.340218] ata3.00: configured for UDMA/133
Sep 12 22:15:10 asterisk kernel: [ 8929.340233] ata3: EH complete
Sep 12 22:15:11 asterisk kernel: [ 8930.332648] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:11 asterisk kernel: [ 8930.334453] ata3.00: BMDMA stat 0x44
Sep 12 22:15:11 asterisk kernel: [ 8930.336245] ata3.00: failed command: READ DMA Sep 12 22:15:11 asterisk kernel: [ 8930.337995] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:11 asterisk kernel: [ 8930.337998] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:11 asterisk kernel: [ 8930.341583] ata3.00: status: { DRDY ERR }
Sep 12 22:15:11 asterisk kernel: [ 8930.343360] ata3.00: error: { UNC }
Sep 12 22:15:12 asterisk kernel: [ 8931.344205] ata3.00: configured for UDMA/133
Sep 12 22:15:12 asterisk kernel: [ 8931.344220] ata3: EH complete
Sep 12 22:15:13 asterisk kernel: [ 8932.306376] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:13 asterisk kernel: [ 8932.308201] ata3.00: BMDMA stat 0x44
Sep 12 22:15:13 asterisk kernel: [ 8932.309948] ata3.00: failed command: READ DMA Sep 12 22:15:13 asterisk kernel: [ 8932.311695] ata3.00: cmd c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in Sep 12 22:15:13 asterisk kernel: [ 8932.311697] res 51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error) Sep 12 22:15:13 asterisk kernel: [ 8932.315262] ata3.00: status: { DRDY ERR }
Sep 12 22:15:13 asterisk kernel: [ 8932.317070] ata3.00: error: { UNC }
Sep 12 22:15:14 asterisk kernel: [ 8933.284204] ata3.00: configured for UDMA/133 Sep 12 22:15:14 asterisk kernel: [ 8933.284234] sd 2:0:0:0: [sda] Unhandled sense code Sep 12 22:15:14 asterisk kernel: [ 8933.284237] sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 12 22:15:14 asterisk kernel: [ 8933.284241] sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Sep 12 22:15:14 asterisk kernel: [ 8933.284246] Descriptor sense data with sense descriptors (in hex): Sep 12 22:15:14 asterisk kernel: [ 8933.284248] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 12 22:15:14 asterisk kernel: [ 8933.284256]         00 00 0f 48
Sep 12 22:15:14 asterisk kernel: [ 8933.284260] sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed Sep 12 22:15:14 asterisk kernel: [ 8933.284267] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 00 00 0f 48 00 00 08 00 Sep 12 22:15:14 asterisk kernel: [ 8933.284274] end_request: I/O error, dev sda, sector 3912 Sep 12 22:15:14 asterisk kernel: [ 8933.286065] Buffer I/O error on device sda1, logical block 233
Sep 12 22:15:14 asterisk kernel: [ 8933.287889] ata3: EH complete


*Why doesn't it fail the first hard disk of the array!!??*

root@asterisk:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      949236 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      311619448 blocks super 1.2 [2/2] [UU]

unused devices: <none>


root@asterisk:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jun 15 22:45:13 2012
     Raid Level : raid1
     Array Size : 311619448 (297.18 GiB 319.10 GB)
  Used Dev Size : 311619448 (297.18 GiB 319.10 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Sep 12 22:07:58 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : asterisk:0  (local to host asterisk)
           UUID : cea0c4c3:181e2ee3:e4d1f3c0:1008ea62
         Events : 68

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1


As you can see the firmware of the hard disk reports a read error and it still doesn't fail the drive: this is the best way to corrupt data :( As far as I know it should fail the bad drive or at least try to resync it allowing the firmware to reallocate the bad sectors.

I really want to understand how raid1 is expected to work, I simply cannot trust something like this. I'd like to take advantage of the failure to learn something about linux's raid1 behavior.

Thanks,
Niccolò





Some more info about the failed disk:

root@asterisk:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-2-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD322HJ
Serial Number:    S17AJDWQ402689
LU WWN Device Id: 5 0000f0 003046298
Firmware Version: 1AC01110
User Capacity:    320,072,933,376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Sep 12 22:27:56 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
Auto Offline Data Collection: Disabled. Self-test execution status: ( 114) The previous self-test completed having the read element of the test failed.
Total time to complete Offline
data collection:                ( 3888) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  66) minutes.
Conveyance self-test routine
recommended polling time:        (   8) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 099 099 051 Pre-fail Always - 428 3 Spin_Up_Time 0x0007 094 094 011 Pre-fail Always - 2810 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1077 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9666 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 8915 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1077 13 Read_Soft_Error_Rate 0x000e 099 099 000 Old_age Always - 400 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 400 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 063 055 000 Old_age Always - 37 (Min/Max 28/45) 194 Temperature_Celsius 0x0022 063 054 000 Old_age Always - 37 (Min/Max 28/46) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 355155576 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 096 096 000 Old_age Always - 361

SMART Error Log Version: 1
ATA Error Count: 173 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 173 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 0f 00 e0  Error: UNC at LBA = 0x00000f48 = 3912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 48 0f 00 e0 08  18d+09:02:03.824  READ DMA
  ec 00 00 00 00 00 a0 08  18d+09:02:03.814  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:02:03.814 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  18d+09:02:02.824  IDENTIFY DEVICE

Error 172 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 0f 00 e0  Error: UNC at LBA = 0x00000f48 = 3912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 48 0f 00 e0 08  18d+09:02:01.814  READ DMA
  ec 00 00 00 00 00 a0 08  18d+09:02:01.804  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:02:01.804 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  18d+09:02:00.854  IDENTIFY DEVICE

Error 171 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 0f 00 e0  Error: UNC at LBA = 0x00000f48 = 3912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 48 0f 00 e0 08  18d+09:01:59.874  READ DMA
  ec 00 00 00 00 00 a0 08  18d+09:01:59.864  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:59.864 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  18d+09:01:58.904  IDENTIFY DEVICE

Error 170 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 0f 00 e0  Error: UNC at LBA = 0x00000f48 = 3912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 48 0f 00 e0 08  18d+09:01:57.924  READ DMA
  ec 00 00 00 00 00 a0 08  18d+09:01:57.924  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:57.924 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  18d+09:01:56.964  IDENTIFY DEVICE

Error 169 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 0f 00 e0  Error: UNC at LBA = 0x00000f48 = 3912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 48 0f 00 e0 08  18d+09:01:55.984  READ DMA
  ec 00 00 00 00 00 a0 08  18d+09:01:55.974  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:55.974 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  18d+09:01:55.014  IDENTIFY DEVICE

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 20% 8895 3912 # 2 Short offline Aborted by host 20% 8871 - # 3 Short offline Aborted by host 20% 8847 - # 4 Short offline Aborted by host 20% 8823 - # 5 Extended offline Aborted by host 90% 8800 - # 6 Short offline Aborted by host 20% 8799 - # 7 Short offline Aborted by host 20% 8775 - # 8 Short offline Aborted by host 20% 8751 - # 9 Short offline Aborted by host 20% 8727 - #10 Short offline Aborted by host 20% 8703 - #11 Short offline Aborted by host 20% 8679 - #12 Short offline Aborted by host 20% 8655 - #13 Extended offline Aborted by host 90% 8632 - #14 Short offline Aborted by host 20% 8631 - #15 Short offline Aborted by host 20% 8607 - #16 Short offline Aborted by host 20% 8583 - #17 Short offline Aborted by host 20% 8559 - #18 Short offline Aborted by host 20% 8535 - #19 Short offline Aborted by host 20% 8511 - #20 Short offline Aborted by host 20% 8487 - #21 Extended offline Aborted by host 90% 8464 -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--
http://www.linuxsystems.it


Reply to: