[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: The danger of dishonest disk drives (WAS:Re: Need to remove a ghost file, but can't because it doesn't exist)



On Thu, 2006-11-23 at 17:31 -0500, Douglas Tutty wrote:

> 
> The question is how does the file system know that a write has made it
> to disk.  E.g. if the file system is atomic transaction oriented, how
> can the file system know that a commit has been committed if the drive
> lies?
> 

Its hard to know for sure especially if the server is under abnormal
load and the inodes are 100% in use, and all that's left is dirty
paging. This seems to be where the problem happens frequently.

I've been following this thread and thought I'd do a bit of
experimenting to see which of the two best recover themselves.

Here's my worst case scenario (and test bed)

Debian Sarge under Xen, 1 40 GB lvm backed partition (jfs)  (#1)
Debian Sarge under Xen, 1 40 GB lvm backed partition (ext3) (#2)

Both LVM backed VBD's live on separate 400 GB SATA drives. Standard O/B
SATA controller (4 port, no raid).

Both systems have a small 512 MB ext2 root FS as a control. The 40 GB
partition was mounted in /datahell

Both systems have 2 GB RAM, 2 CPU's (Test conducted on a Dual Opteron),
test machine one has cpu0 core 0 cpu2 core 1, test machine 2 has cpu0
core1 cpu2 core0.

So now we have for all intensive purposes 2 machines with a single dual
core opteron in them.

Here was the test :

Untar about 12 GB worth of files on both drives.. these files consist of
some old backup CD's, shareware CD's .. just thousands and thousands of
files. 

I then ran a shell script that caused 'updatedb' to fork a few hundred
times in the background on each server, it kept forking
until /proc/loadavg got to be about 70.0

Once that happened, I paused both VM's, issued a sysrq to sync disks and
destroyed them in memory. This simulated an out of control box where the
admin was able to effect a shutdown where disks synced (not just push
re-set). 

Booted them up again :

Ext3 spent 30 minutes in a fsck, some data was lost 

jfs spent 5 minutes, no data was lost

ext2 root FS didn't have any issues.. but nothing was being written to
it during the experiment.

Experiment #2

Fresh 20 GB partitions just like before :

Same experiment, only this time I didn't sync disks. I just destroyed
the VMs in memory (same as pulling out the power plug), rebooted.

ext3 fixed a couple of inodes and came back pretty quickly
jfs drive wasn't able to be mounted.

Again, ext2 root fs had no issues, but we weren't expecting any. ext2
rootfs was used just as a control (and to boot). /var was moved to the
second drive (where slocate's DB lives).

End result is, its going to depend on how the file system manages to
allocate inodes ahead of itself , and at what point in time your system
runs out of clean pages to grab. JFS seems to do well *only* if your
able to sync disks and it can write those inodes.. it leaves quite a bit
of data in memory. However its much happier about flushing its inode
cache and syncing even if all that is available is dirty paging.

ext3 seems more likely to recover from its journal in the event you
can't sync disks, but syncing it with maxxed/bloated inodes (reaching
into dirty pages) seems to break it.

Its really application specific I guess.. if you have the luxury of
being able to anticipate what the world will do to your public services
once you plug the Internet into a server the choice is a little
easier .. but there is no magic bullet :)

Ext3 seems more likely to come back to life after an unattended crash
(where nobody was there to try and slow down the skid.)

JFS seems like the winner if your system doesn't often get abused, or if
you have the ability to monitor it closely and intercede should you see
dirty paging (swap) and inodes running high. Note, because JFS seems to
use much more memory to allocate its inodes, this may lead to your
applications needing swap faster than they would with ext3. 

6 of one , 1/2 dozen of the other really.. but hopefully my little
experiment helps someone decide which one is best to use :) I had a few
systems setup for an ocfs2 stress test and figured I'd take advantage of
it for this.

I was in no way measuring i/o performance .. just how well file systems
came back to life after bad things happened.

Best,
-Tim




> Doug.
> 
> 



Reply to: