Re: ..fixing ext3 fs going read-only, was : Sendmail or Qmail ? ..
On Thu, Sep 11, 2003 at 02:04:19AM +0200, Arnt Karlsen wrote:
> ..I still believe in raid-1, but, ext3fs???
> ..how does xfs, jfs and Reiserfs compare?
If you have random disk corruptions happening as often as you are, no
filesystem is going to be able to help you. The only question is how
quickly the filesystem notices *before* user data starts getting
irrecovably lost. Ext3 generally tends to be one of the more paranoid
filesystems about checking assertions and "should never happen cases",
although I don't know how it compares to reiserfs, jfs, et. al.
There are have certainly been cases in the past where people were
convinced that there was a bug in ext2, since other filesystems (minix
in this particular case) weren't reporting the problem. But, it
turned out to be a buffer cache bug, and it was simply that other
filesystems were not doing the appropriate assertion checks, and user
data was getting lost; the system administrator was just left in
> > Unless you're talking about *software* RAID-1 under Linux, and the
> ..bingo, I should have said so.
> > fact that you have to rebuild mirror after an unclean shutdown, but
> > that's arguably a defect in the software RAID 1 implementation. On
> > other systems, such as AIX's software RAID-1, the RAID-1 is
> > implemented with a journal,
> ..but software RAID-1 under Linux is not or did I miss something here?
No, software RAID-1 does not do journalling at the RAID level. That
means that in the case of a unclean shutdown, the RAID system will
need to restablish the mirror. As I said, this is a performance
issue, since half the disk bandwidth of the RAID array will be
diverted to restablishing the mirror during the unclean shutdown.
Note also this is true *regardless* of what filesystem you use,
journaling and non-journaling.
> ..ok, for my throttle boxes, here is where I should honk the
> horn and divert logging to a log server and schedule a fsck?
> (And ofcourse just reboot my mailservers on the same error.)
For your throttle boxes, do you need to have any writes to your
filesystems at all? If what you care about is zero downtime, why not
just run syslog over the network, and keep all of your filesystems
mounted read/only? Some extreme configurations I've seen (especially
where ISP's don't have direct/easy access to their systems at remote
POP's), use a read-only flash filesystem, and a ramdisk for /tmp, and
no spinning disks at all. This significantly increases reliability
caused by disk failures, since the hard drive is often the most
vulnerable part of the system, especially in the face of heat
> ..IMHO the debian bootstrap should first read the rpm database
> and generate a deb database, and then do 'apt-get update && \
> apt-get dist-upgrade'. _Is_ there such a bootstrap beast?
While this would be interesting for those people who are converting
from Red Hat to Debian, it's a lot more complicated than that, since
you also have to convert over the configuration files; Red Hat and
Debian don't necessarily store files in the same location.
I generally find that for production systems, it's much safer and
simpler to install Debian on a new disk (and on a new system), and
then copy over the new configuration files over. That way, you can
test the system and make sure everything is A-OK before cutting over
something on a production system.
(By the way, it seems like 50% of your problems is that you're doing
things on the cheap, and yet you still want 100% reliability. If you
want "carrier-grade reliability", you need to pay a little bit extra,
and do things like have hot spares, and installation scripts that
allow you to create and configure new servers automatically, without
needing manual handwork.)
> ..256MB, but the disks may be marginal, on the known bad disks I get
> write errors. I have seen this same error on power "blinks", failures
> lasting for about a 1/3 of a second without losing monitor sync etc
> on my desktops, once frying a power supply, but usually these "blinks"
> cause no harm.
Sounds like you have marginal power. Do you have a UPS (preferably a
continuous UPS) to protect your systems? If not, why not? (Again,
it's a bad idea to expect "carrier-grade relaibility" when you're not
willing pay for the basic high-quality equipment, backup equipment,
and devices such as UPS's to protect your equipment.)
> ..ah. So with a 30GB /var ext3fs raid-1 I would have 25% or 13%
> consumed by backup copies of the superblock and block group descriptors?
It's an order n**2 problem; so it's not a linear relationship. And
most people get annoyed by that kind of overhead, long before it gets
to 10% or above.
> ..how does the journalling system choose which blocks to work from?
> What I've been able to see, the journal dies when their super blocks
> go bad?
The filesystem needs the superblock in order to find the journal. If
you have a single gigantic filesystem mounted on /, then if the
primary superblock is corrupted, the kernel will not be able to mount
/, and you're hosed. E2fsck will automatically try the primary
superblock, and if that is corrupt, it will try the first backup
superblock. Failing that, a human will need to manually try one of
the other backup superblocks, if it is corrupted as well.
If your primary superblock is getting corrupted often, then first of
all, you should try to figure out why this is happening, and take
affirmative actions to prevent them. (The fact that you're reporting
marginal power is supremely suspicious; marginal power can cause disk
corruptions very easily. Getting higher quality power supplies will
help, but a UPS is the first thing I would get.)
Secondly, you're better off using a small root filesystem that
generally isn't modified often. What I normally do is use a 128 meg
root filesystem, with a separate /var partition (or /var symlinked to
/usr/var), and /tmp as a ram disk. With the root filesystem rarely
changing, it's much less likely that it will be corrupted due to
hardware problems. Then the root filesystem can come up, and e2fsck
can repair the other filesystems.
But I repeat, your filesystems shouldn't be getting corrupted in the
first place. Using a separate root filesystem is a good idea, and
will help you recover from hardware problems, but your primary
priority should be to avoid the hardware problems in the first place.