[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

HELP! Re: How to fix I/O errors?



How it went is not well.  I tested the new drive with SeagateTools and it was fine.  Then I made a clonezilla live CD and booted from it.  It stopped on the first read error with a message saying to restart using the rescue option.  I did that.  After 5 hours it finished without mentioning any errors.

I tried to boot to the old disk (since it was still wired that way).  I got dropped int a maintenance shell with fs errors in /dev/sda4 which is the physical volume for all my LVM logical volumes -- /usr, /var, /home and /temp.  It says to run fsck manually.  

I decided to try the new drive, so I changed the cables and re-booted.

Maintenance shell, again.

/ mounted clean 

lvm started

/home fs has errors run fsck (at this point, I'm afraid to try it)

/var, /usr, and /tmp all say that the superblock can not be read, or is invalid.  Try running 

e2fsck -b 8193 <device>
or
e2fsck -b 32768 <device>

Which do I use?

How did trying to clone the disk nake such a mess of BOTH disks?

Any help getting a working system again will be greatly appreciated.

Marc

On Feb 6, 2017 2:37 PM, "David Christensen" <dpchrist@holgerdanske.com> wrote:
On 02/06/17 13:15, Marc Shapiro wrote:
I am pasting the result of smartctl -x /dev/sda below as I have no real
clue what to do with the information, but I have a few questions first.

1) I have purchased a new, very similar, Seagate 1TB drive and I plan to
install it and copy the whole system to the new drive.

It sounds like you don't have a backup of the failing 1 TB drive (?).


Do you have a file server with ~1 TB of free space?  RAID?


Run memtest86+ for 24+ hours to verify that you don't have a memory problem.


Use SeaTools to wipe the new 1 TB drive and run the short and long tests.  Stop if anything fails.



What is the best
way to do this copy since I don't wangt to copy bad sectors?

I've done it with 'dd' in the past, but will use 'ddrescue' in the future.



2) Once I have verified that the new drive boots

I'd do a fresh install on a 16+ GB SSD (USB flash drives also work).  A recovered system disk image is too uncertain.



and everything is running properly

As I understand it, the drive microcontroller calculates and stores a checksum with every sector (block).  That's one way it knows that a block is bad upon reading.  So, when you copy out whatever blocks you can get, you probably won't have errors in those blocks.


But, files and directories are stored on one or more sectors.  Depending upon your file system, fsck may or may not find the missing blocks.


When you're done, the destination disk is likely to be missing files and/or directories.



I am hoping to reformat the old drive.  This should
reallocate the bad sectors IIRC.  I then would like to set up a raid
with both drives (keeping a close eye on the old drive).The
feasibility of this, I would guess, depends on what the posted smartctl
information tells someone who knows what to look for.

3) As I understand it, the above mentioned raid should be safe since,
even if the old drive deteriorates further, the system can run on just
the new drive.  Is that correct?

Once you've copied out whatever blocks you can get, use SeaTools to wipe the old 1 TB drive and run short and long tests.  If all three pass, I might be tempted to re-use the drive.


If it fails to wipe and has plaintext, destroy it with a sledge hammer. (Wear safety glasses!)


If it wipes but fails the short or long tests, recycle it.



Here is the smafrtctl output:
...

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Interesting, given that the drive failed SeaTools (short test?).



General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)    The previous self-test
completed having
                    the read element of the test failed.

Matches SeaTools result.



Total time to complete Offline
data collection:         (  600) seconds.
...

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   117   095   006    - 165391146
  3 Spin_Up_Time            PO----   095   093   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    406
  5 Reallocated_Sector_Ct   PO--CK   072   072   036    -    1181
  7 Seek_Error_Rate         POSR--   087   060   030    - 656506200
  9 Power_On_Hours          -O--CK   048   048   000    -    46195
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    203
183 Runtime_Bad_Block       -O--CK   092   092   000    -    8
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   011   011   000    -    89
188 Command_Timeout         -O--CK   100   097   000    - 51540394008
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   070   049   045    -    30 (Min/Max
27/32)
194 Temperature_Celsius     -O---K   030   051   000    -    30 (0 20 0
0 0)
195 Hardware_ECC_Recovered  -O-RC-   034   003   000    - 165391146
197 Current_Pending_Sector  -O--C-   093   083   000    -    310
198 Offline_Uncorrectable   ----C-   093   083   000    -    310
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    26
240 Head_Flying_Hours       ------   100   253   000    -    46718 (49
76 0)
241 Total_LBAs_Written      ------   100   253   000    - 1725386978
242 Total_LBAs_Read         ------   100   253   000    - 265479204
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

I have yet to find a good explanation for reading smartctl reports. This post gives some clues:

https://ubuntuforums.org/showthread.php?t=2192335
Here are the statistics for my ST3000DM001:


Here is my ST3000DM001 for comparison:


SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   115   099   006    -    90256224
  3 Spin_Up_Time            PO----   094   094   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    577
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   063   060   030    -    1955231
  9 Power_On_Hours          -O--CK   096   096   000    -    3552

 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    576
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0

184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0

189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   070   059   045    -    30 (Min/Max 19/30)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    35
193 Load_Cycle_Count        -O--CK   100   100   000    -    1323
194 Temperature_Celsius     -O---K   030   041   000    -    30 (0 17 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    269092585999820
241 Total_LBAs_Written      ------   100   253   000    -    2338230420
242 Total_LBAs_Read         ------   100   253   000    -    19882466886


These statistics for your drive look suspicious:

Reallocated_Sector_Ct
Reported_Uncorrect
Runtime_Bad_Block


...

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 89 (device log contains only the most recent 20 errors)

That's not good.  Mine says:

No Errors Logged



SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90% 46194

This could be SeaTools (?).


Let us know how it turns out.


David



Reply to: