[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fsck on boot...revisited



Roger Leigh wrote:
> green wrote:
> > Tim Nelson wrote:
> > > On occasion, we find that a filesystem error is bad enough that
> > > instead of auto{matically|magically} fixing the issue and continuing
> > > to boot, the system hangs, needing a root password entered for a
> > > manual fsck to be run.
> > > 
> > > My question is thus: How do I prevent that requirement to login and
> > > run fsck manually? Is there some parameter that can be set? Or, am I
> > > going about this the completely wrong way?
> > 
> > You mentioned the FSCKFIX option; according to rcS(5) man page,
> > setting it to "yes" in /etc/default/rcS will do what you want.  This
> > causes fsck to be run with -y instead of -p which is somewhat
> > dangerous but hopefully will in your case successfully repair the
> > filesystem.

I always set FSCKFIX=yes in /etc/default/rcS and think that is the
best default.

> From a usability point of view, there have been many requests
> over the years to make FSCKFIX=yes the default.  However, from
> a safety point of view, this is not fine due to the risk of
> unrecoverable data corruption if it does the wrong thing.

The problem is that a vanishingly small number of people know how to
drive a filesystem debugger and repair a broken filesystem better than
the automated tools.  Many more people operate headless computers as
servers.  If you, the reader of this message, are one of the elite
souls who can manually fix a corrupted filesystem then that is
awesome.  But for the rest of us we don't have the needed knowledge
and skills to do so.

For the larger majority of users the current default setting of
FSCKFIX=no is a problem because it will result in a system that won't
boot without a human on the console to manually answer yes to the fsck
questions.  On a desktop you are there and just do it.  On a server
you need to get on the console.  Typically that requires a support
request in the simple case if the machine is in a data center.  But
for many of us it would mean a long drive into another city hosting
the system in order to physically touch the hardware, attach a
console, and answer yes.  For more of us the FSCKFIX=yes setting is a
much better default.

> We would prefer the admin to take responsibility for any needed
> actions prior to fsck (imaging the disc, backups, etc.)

I must object to this.  I do not personally know anyone in the real
world that I could eat lunch with who has the skills to manually
repair a corrupted filesystem.  I am confident that if it were
possible to conduct a fair poll of the readers of debian-user that an
extremely small percentage of readers would have the skill to do so.
(I am sure that some do.  You reading this might be one of those few.)

And yet we are all using Debian systems.  For the vast majority of us
we must count on the automated fsck to repair the filesystem.  For
most of us if the filesystem needs an fsck then we would answer yes to
proceed with the fsck.  And if the fsck is unable to do the repair
then we would fall back to restoring from backup.  I know that I would
create a new filesystem and would restore from backup in that case.  A
backup is always still needed for safety and RAID does not remove the
need for it.

And so I object to the idea that allowing the admin to choose to fsck
or not is giving the admin any real choice.  It really isn't much
choice.  I don't think it is any real choice at all.  It feels simply
like a way to way to offload and deflect blame.  It is now always
possible to point fingers at the local admin.  "If you lost data then
it is your fault because you pushed the button."  And yet for most
that is the only thing they can do.  I myself would push the button.

About the only other option would be to take each disk and make a bit
copy of each of the disks in the system.  Save those copies off.  Then
if something goes wrong they can send those full bit copies to someone
else who has the skills to possibly recover the data.  That would
certainly always be a safe recipe whenever a system crashes and needs
an fsck.

So say you have a simple system with two 1T disks in a RAID1 mirror.
You would only need two more 1T disks to make a full bit copy of both
disks.  You would only need an additional system in which to mount
those disks onto and to use as a host for making the copy.  You only
need a few hours of your time to physically pull the disks and do the
copy.  And to be careful and not make an additional mistake making a
simple problem more complicated.  With a full copy then you could
always repeat the restoration several times using different techniques
and definitely increase your odds of success.  If you ever need to
fsck the system then the safest recipe would be to always do this.

But how many people reading debian-user have these resources of extra
disks and systems?  I could certainly do this for myself but even I
consider that too much trouble.  I simply run the fsck and expect that
99 times out of a 100 (numbers I just made up) that it would succeed.
In that quite unusual case when it is not able to automatically repair
it then I would restore from backup.  Because restoring from backup in
that unusual case is easier than always doing the safest thing of
making a bit copy of the disks.  And safer because out of 99 times of
pulling the disks and mounting and copying them I would almost
certainly make a human error and break something else along the way.

> and in some setups e.g. software RAID, it's possible we might fsck
> parts of an unreconstructed RAID set and totally destroy it.

Could you say a few more words about how this might occur?  Because I
cannot think of a way for this to happen.  Certainly I could force it
to happen.  But forcing it to happen isn't the same thing as it
happening accidentally with a normal system configuration and an
accident such as a power loss system crash.

Me not being able to think of a way for this to happen doesn't mean it
can't.  I learn something new as often as possible.  And so I would
like to be educated on how it might happen.  Because this does not
seem possible given the way Debian is structured and if it is then I
would like to understand the failure case.  Please say how FSCKFIX=yes
might cause a catastrophic loss.

> There are quite a few other pros and cons, but that's essentially
> the reason for it being opt-in; you take the responsibility for the
> small chance it might do rather bad things.

Every time we turn power onto a system we take responsibility that
something bad might happen.  If we don't like it then we can choose
not to turn the power on.  Of course such a choice isn't a useful
choice.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: