[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fsck'd



On Fri, Mar 07, 2008 at 08:29:08AM -0500, John Fleming wrote:
> Backgroud - I had a well-established LAMP server that was giving some
> filesystem errors on boot, with the "hit control-D to continue or give root
> password to fix manually" message. It would go ahead and work normally if I
> hit Control-D.  

You should have fixed it manually.  Control-D is in case the person
sitting there doesn't have the root password and is generic to Debian's
single-user mode.

> However, I wanted to try to get rid of the error and need for human
> intervention in the event of the need for a remote reboot, so I tried
> to fix the errors with fsck.  Somehow I ended up with a badly trashed
> filesystem and inability to reboot.
> 
> After much knashing of teeth and consideration of my options, I
> installed a new etch system.  I installed several benign packages like
> apache.  I ran update and dist-upgrade to bring the system up to date.
> When I ran the upgrade, it told me that it was trying to install an
> identical kernel image. 

Well, it was a new kernel image with the same version code so that the
new modules would be going in the same directory as the old modules.
The new modules won't work with the old kernel so that if you do
anything to trigger a module load, bad things can happen, which is why
you reboot as soon as the upgrade is complete.

> It explained some things about what it was doing about modules, and
> then said to be sure to reboot.  I did that, but then it gave me that
> now-familiar message about how the filesystem has errors, hit
> control-D to continue...

> I booted with Knoppix, made sure my filesystem on /dev/hda1 was NOT
> mounted, and ran fsck -f.  It did the 5 passes without mention of
> errors.  I ran it a second time with same results.  However, when I
> boot from /dev/hda1, I still get the error about a filesystem with
> errors!
> 
> Trying to rebuild the server as it was is painful enough - Why would I
> be having these filesystem errors?  The HDD is relatively new.
> 
> Any other way to try to get rid of the boot error before I reinstall
> etch again?  I hate to do that because I don't understand how these
> errors originate, so I don't know why I shouldn't expect them to crop
> up again at some point later after another fresh install.
> 
> Why does the fsck during boot find errors when the fsck run via
> knoppix on the same filesystem return clean?

Don't know why.  Here's how I'd proceed:

1.	boot with the kernel command line: init=/bin/sh since debian's
single-user mode gives you most filesystems already mounted.

2.	run fsck (read the man page to give you the options appropriate
to your root fs);  run it on all your filesystems.

3.	shutdown -h and power-cycle.

4.	run aptitude update then upgrade anything required.

5.	reboot.  Watch the screen for any errors on shutdown that would
	suggest that the system isn't, e.g. remounting the / fs ro
	before halt/reboot.  If in doubt, set up a serial console and
	log the output or set up the console output to go to a printer.

6.	If you still have problems, boot knoppix (I use grml) and run
	fsck.  If this is ext2/3, I'd run -c -c so that the entire disk
	gets read to force the drive firmware to re-map any bad sectors.
	While this is running, I'd be watching /var/log/syslog for any
	errors from the drive.

7.	Ensure that you have SMARTmontools installed and run a long
	smart test and when its complete, check the results on the
	drive.

---

If all else fails, plan for a reinstall (ensure that you have backups).
Then boot knoppix and run wipe on the drive.  This fully exercises the
drive to exorsize any gremlins.  Then install etch minimal (don't select
any tasks), ensure that aptitude is installed (if it isn't, then apt-get
aptitude), get aptitude set up the way you like with only necessary
packages marked as manual, the rest as automatic, then do an update and
upgrade before you install any other packages.

At each stage, do a shutdown -rF.

Doug.



Reply to: