[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#446323: mdadm: recovery in infinite loop



On Thu, Feb 5, 2009 at 11:22 PM, Lukasz Szybalski <szybalski@gmail.com> wrote:
> On Thu, Feb 5, 2009 at 12:45 PM, martin f krafft <madduck@debian.org> wrote:
>> also sprach Lukasz Szybalski <szybalski@gmail.com> [2009.02.05.0731 +0100]:
>>> Both drives are fairly new...Would the drive be defective since day1?
>>>
>>> Was the error log that I am getting now available in previous version
>>> of the kernel?
>>
>> Yes.
>>
>> Either the drives are faulty or the controller.
>
> This is a software raid? Which controller?
>
> Is there a way to run fschk in a way that moves the bad sectors into good once.
>
> I've also run the smartctrl on the hard drive that being added which
> gives errors and it shows "no errors occurred"
>
> aptitude install smartmontools
>
>
> "=== START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> No Errors Logged
> "
>
>
> And the drives that is active and I want to sync from:
> "=== START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> ATA Error Count: 1665 (device log contains only the most recent five errors)
>        CR = Command Register [HEX]
>        FR = Features Register [HEX]
>        SC = Sector Count Register [HEX]
>        SN = Sector Number Register [HEX]
>        CL = Cylinder Low Register [HEX]
>        CH = Cylinder High Register [HEX]
>        DH = Device/Head Register [HEX]
>        DC = Device Command Register [HEX]
>        ER = Error register [HEX]
>        ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 1665 occurred at disk power-on lifetime: 11609 hours (483 days + 17 hours)
>  When the command that caused the error occurred, the device was
> active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 07 5b 00 48 e0  Error: UNC 7 sectors at LBA = 0x0048005b = 4718683
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 5a 00 48 05 28      13:59:36.510  READ DMA EXT
>  25 00 08 52 00 48 05 28      13:59:36.510  READ DMA EXT
>  25 00 08 4a 00 48 05 28      13:59:36.510  READ DMA EXT
>  25 00 08 42 00 48 05 28      13:59:36.510  READ DMA EXT
>  25 00 08 3a 00 48 05 28      13:59:36.510  READ DMA EXT
>
> Error 1664 occurred at disk power-on lifetime: 11609 hours (483 days + 17 hours)
>  When the command that caused the error occurred, the device was
> active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 37 5b 00 48 e0  Error: UNC 55 sectors at LBA = 0x0048005b = 4718683
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 38 5a 00 48 05 28      13:59:34.340  READ DMA EXT
>  25 00 40 52 00 48 05 28      13:59:32.250  READ DMA EXT
>  25 00 48 4a 00 48 05 28      13:59:30.185  READ DMA EXT
>  25 00 50 42 00 48 05 28      13:59:28.060  READ DMA EXT
>  25 00 58 3a 00 48 05 28      13:59:25.970  READ DMA EXT
>
> Error 1663 occurred at disk power-on lifetime: 11609 hours (483 days + 17 hours)
>  When the command that caused the error occurred, the device was
> active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 37 5b 00 48 e0  Error: UNC 55 sectors at LBA = 0x0048005b = 4718683
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 40 52 00 48 05 28      13:59:32.250  READ DMA EXT
>  25 00 48 4a 00 48 05 28      13:59:30.185  READ DMA EXT
>  25 00 50 42 00 48 05 28      13:59:28.060  READ DMA EXT
>  25 00 58 3a 00 48 05 28      13:59:25.970  READ DMA EXT
>  25 00 60 32 00 48 05 28      13:59:23.735  READ DMA EXT
>
> Error 1662 occurred at disk power-on lifetime: 11609 hours (483 days + 17 hours)
>  When the command that caused the error occurred, the device was
> active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 37 5b 00 48 e0  Error: UNC 55 sectors at LBA = 0x0048005b = 4718683
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 48 4a 00 48 05 28      13:59:30.185  READ DMA EXT
>  25 00 50 42 00 48 05 28      13:59:28.060  READ DMA EXT
>  25 00 58 3a 00 48 05 28      13:59:25.970  READ DMA EXT
>  25 00 60 32 00 48 05 28      13:59:23.735  READ DMA EXT
>  25 00 68 2a 00 48 05 28      13:59:21.665  READ DMA EXT
>
> Error 1661 occurred at disk power-on lifetime: 11609 hours (483 days + 17 hours)
>  When the command that caused the error occurred, the device was
> active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 36 5c 00 48 e0  Error: UNC 54 sectors at LBA = 0x0048005c = 4718684
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 50 42 00 48 05 28      13:59:28.060  READ DMA EXT
>  25 00 58 3a 00 48 05 28      13:59:25.970  READ DMA EXT
>  25 00 60 32 00 48 05 28      13:59:23.735  READ DMA EXT
>  25 00 68 2a 00 48 05 28      13:59:21.665  READ DMA EXT
>  25 00 70 22 00 48 05 28      13:59:19.260  READ DMA EXT
>
> lucas@hplinux:~$ cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4] [raid0]
> md4 : active raid1 hda5[0] hdc5[1]
>      1951744 blocks [2/2] [UU]
>
> md2 : active raid1 hdc2[1]
>      276438400 blocks [2/1] [_U]
>
> md0 : active raid1 hda1[0] hdc1[1]
>      34178176 blocks [2/2] [UU]
> "
>



Since it seems as hdc has some issues, I know I can copy stuff
manually from hdc2 to hda2.

cp -dpRx /files /mnt/hda2

How do I make the hdc2 be removed from md2 (md2 will be empty then)
then add hda2 as primary/first hard drive for md2 partition and then
sync hdc2 to it.?

At this point I think we can close this bug as it seems it is a
harddrive issue unless somebody has similar issue, or might know what
is causing it. I guess I should have checked the smartctrl when I
initially got the hard drive to see if there are any errors.

Thanks,
Lucas



Reply to: