[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Syncing during installation can prevent massive filesystem corruption



I have found that possibly the worst time to have a system lockup is during a software installation or upgrade. I've found this out the hard way and wish to suggest a way to make it a little safer - sync the disks frequently during the installation.

If you select a bunch of packages with tasksel, or you do an update after not having updated in a long time, you could be installing a couple hundred packages that contain thousands, possibly tens of thousands of files. In my old installation yesterday all my packages together consisted of 42,000 files, and I didn't have a particularly large number of packages installed.

Think about what happens in the kernel when you write to a filesystem. Disk writes go into the buffer cache and aren't committed to disk right away. Suppose you delete a file and write a new one. The kernel creates a buffer cache entry that reflects the state the disk should be in with the old file deleted, and then copies the new file data into some cache blocks and creates an in-memory inode that stores where those blocks will eventually be in the filesystem.

Eventually the cache fills up and the least-recently used blocks finallyg et committed to disk. For a non-journaled filesystem like ext2, the order the data gets committed won't bear any particular relation to the final data structure desired. When this happens the filesystem that's actually on the drive is temporarily in a corrupted state, and lots of your file data is missing - the filesystem can only be considered correct if you take the buffer cache into account, and that will be lost in the event of a crash or power failure.

If you're writing a lot of files, before all of the buffer cache gets committed the byte pattern that's on the hard drive is temporarily in a highly corrupted state.

Now my sad experience. I was running a kernel before (2.4.14 for PowerPC) that I guess must have been buggy because I would get sudden lockups from time to time. It seemed to come when there was a lot of filesystem activity, such as when doing an update. Progress of the update would halt, X11 would become unresponsive, and I'd have to power off the machine. I couldn't ssh in to sync the disk.

When this happened yesterday it caused such horrendous corruption to my /, /var and /usr partitions that I eventually gave up on repairing the damage and just reformatted and reinstalled from scratch. Running fsck -y found and fixed hundreds, if not thousands of errors. Yes, my friends, fsck fixed my filesystems but good.

The problem I found once I could run fsck without complaint was that the data content of a lot of files was just plain wrong. For example, when I tried to resume the update (having already downloaded all the files), logrotate wouldn't reinstall because dpkg claimed that the file /usr/sbin/logrotate was part of package mailx.

What I eventually found was that the file /var/lib/dpkg/info/mailx.list didn't list the files that belonged to dpkg anymore - it clearly contained logrotate's list! There were several files like this. I found a couple of files in /etc that erroneously contained some manner of typographical information instead of configuration data.

At first I tried to fix the mess manually but after a while realized that I wouldn't be likely to find all the files with bad contents and just wiped and reinstalled.

Now here's my suggestion:

When dpkg completes installing one package it should sync all the filesystems. Also sync each time a file finishes downloading during the package downloads. That's it.

Frequently syncing the filesystems slows a machine down because you don't get the benefit of the buffer cache. But fault-tolerance during an upgrade, when you are making drastic changes to critical files, is of much more importance than performance.

A simple way to approach this before dpkg gets modified (or if the debian developers choose not to accept my suggestion) would be to run a script like the following in the background just before starting up dselect to do an install or upgrade. Leave it running until the installation is complete:

#!/bin/sh
#/usr/local/bin/keep-syncing

while true
do
    sync
    sleep 3
done

Well I'm running kernel 2.4.18-5 now. Hopefully that won't crash on me anymore. But I'm going to use my keep-syncing script during any future upgrades.

Mike
--
Michael D. Crawford
GoingWare Inc. - Expert Software Development and Consulting
http://www.goingware.com
crawford@goingware.com

  Tilting at Windmills for a Better Tomorrow.

    "I give you this one rule of conduct. Do what you will, but speak
     out always. Be shunned, be hated, be ridiculed, be scared,
     be in doubt, but don't be gagged."
     -- John J. Chapman, "Make a Bonfire of Your Reputations"
        http://www.goingware.com/reputation/



Reply to: