Re: need help with approx-gc

Paul E Condon wrote:
> The following is just a few examples from kern.log:
> May  8 11:32:49 cmn kernel: [4880283.861051] end_request: I/O error, dev sda, sector 16136192

Ouch!  You have a disk that is crying out for help.  Oh the pain and
suffering of it!

> All of them have the same sector number. This is the sda drive,
> which is formatted as ext4. Is there some way that the automatic
> reallocate could the repaired by a forced manual fsck? and is the
> rescue function on the netinst CD adequate for this?

I have often been in your same situation.  I would ensure that the
backup is current and valid and then replace the disk.  That is me.  I
have seen disks get worse very quickly after they have exhibited
failures.  Modern disk controllers keep internal spares.  By the time
the disk is showing errors externally the internal spares have
probably all been consumed with other failures.

Problems like this will quickly make you a believer in RAID.  I pretty
much raid everything these days just to avoid being in this
situation.  In a RAID the bad disk would have already been kicked out
of the raid array.  It would then be left running in degraded mode on
the remaining drives.  The system would keep running without
problems.  Replacing the failing drive and backfilling the raid array
can all occur while the system is up and online.

> Not running SMART.
> What Debian package provides smartctl ?

  apt-get install smartmontools
  smartctl -l error /dev/sda

I expect that to show errors.

  smartctl -t short /dev/sda
  sleep 120
  smartctl -l selftest /dev/sda

I expect that to show errors.

> I don't think the following tests will make the reallocation problem
> go away.

Nope.  Seems like a disk failure to me.

> I was planning to do something else this weekend, Oh well.

RAID.  I can't say enough good things about it in these situations.
And backup.

BTW...  I have a low priority machine that is crying right now that
SMART selftests are failing.  It hasn't gotten to the actual I/O
failure error stage yet but it is only a matter of time.  It is a low
priority machine so I haven't actually done anything yet.  It is still
up and running.  But I have a disk and as soon as I get a few spare
minutes this weekend I am going to go swap out the failing disk for
another.  But tomorrow looks pretty busy for me.  I probably won't get
to it until Monday.  And I have no stress about it because it is a
raid and the other disk is healthy.  Plus backups are current.


