Re: ..fixing ext3 fs going read-only, was : Sendmail or Qmail ? ..
On Thu, 11 Sep 2003 14:03:17 -0400,
Theodore Ts'o <email@example.com> wrote in message
> On Thu, Sep 11, 2003 at 02:04:19AM +0200, Arnt Karlsen wrote:
> > ..I still believe in raid-1, but, ext3fs???
> > ..how does xfs, jfs and Reiserfs compare?
> If you have random disk corruptions happening as often as you are, no
> filesystem is going to be able to help you. The only question is how
> quickly the filesystem notices *before* user data starts getting
> irrecovably lost. Ext3 generally tends to be one of the more paranoid
> filesystems about checking assertions and "should never happen cases",
> although I don't know how it compares to reiserfs, jfs, et. al.
..ok, how about ext3 versus ext2 on raid-1?
> > > Unless you're talking about *software* RAID-1 under Linux, and the
> > ..bingo, I should have said so.
> > > fact that you have to rebuild mirror after an unclean shutdown,
> > > but that's arguably a defect in the software RAID 1
> > > implementation. On other systems, such as AIX's software RAID-1,
> > > the RAID-1 is implemented with a journal,
> > ..but software RAID-1 under Linux is not or did I miss something
> > here?
> No, software RAID-1 does not do journalling at the RAID level. That
> means that in the case of a unclean shutdown, the RAID system will
> need to restablish the mirror.
..and after a journal death, and fsck, the raid set will be able
to re-establish itself, no? Or does the journal do both/all disks
in a raid set?
> As I said, this is a performance issue, since half the disk bandwidth
> of the RAID array will be diverted to restablishing the mirror during
> the unclean shutdown. Note also this is true *regardless* of what
> filesystem you use, journaling and non-journaling.
..noted, non-issue in my case.
> > ..ok, for my throttle boxes, here is where I should honk the
> > horn and divert logging to a log server and schedule a fsck?
> > (And ofcourse just reboot my mailservers on the same error.)
> For your throttle boxes, do you need to have any writes to your
> filesystems at all? If what you care about is zero downtime, why not
> just run syslog over the network, and keep all of your filesystems
> mounted read/only? Some extreme configurations I've seen (especially
> where ISP's don't have direct/easy access to their systems at remote
> POP's), use a read-only flash filesystem, and a ramdisk for /tmp, and
> no spinning disks at all. This significantly increases reliability
> caused by disk failures, since the hard drive is often the most
> vulnerable part of the system, especially in the face of heat
> vibrations, etc.
..sounds like an idea. The major point against is geography,
I like to arrive at stand-alone one-box solutions, but networked
logging is a good way to verify the network status. What is
used, ssh tunnels?
> > ..IMHO the debian bootstrap should first read the rpm database
> > and generate a deb database, and then do 'apt-get update && \
> > apt-get dist-upgrade'. _Is_ there such a bootstrap beast?
> While this would be interesting for those people who are converting
> from Red Hat to Debian, it's a lot more complicated than that, since
> you also have to convert over the configuration files; Red Hat and
> Debian don't necessarily store files in the same location.
..I know. ;-)
> I generally find that for production systems, it's much safer and
> simpler to install Debian on a new disk (and on a new system), and
> then copy over the new configuration files over. That way, you can
> test the system and make sure everything is A-OK before cutting over
> something on a production system.
..yeah, my pipe dream. ;-)
> (By the way, it seems like 50% of your problems is that you're doing
> things on the cheap, and yet you still want 100% reliability. If you
> want "carrier-grade reliability", you need to pay a little bit extra,
> and do things like have hot spares, and installation scripts that
> allow you to create and configure new servers automatically, without
> needing manual handwork.)
..hey, the isp shop is not mine, and it _is_ a small operation,
so I need to grow it so I can charge'em. ;-) These guys are
Wintendo convertites, and I do the hard stuff for 'em. ;-)
> > ..256MB, but the disks may be marginal, on the known bad disks I get
> > write errors. I have seen this same error on power "blinks",
> > failures lasting for about a 1/3 of a second without losing monitor
> > sync etc on my desktops, once frying a power supply, but usually
> > these "blinks" cause no harm.
> Sounds like you have marginal power. Do you have a UPS (preferably a
> continuous UPS) to protect your systems? If not, why not? (Again,
> it's a bad idea to expect "carrier-grade relaibility" when you're not
> willing pay for the basic high-quality equipment, backup equipment,
> and devices such as UPS's to protect your equipment.)
..2 different sites, I have marginal power in my lab, but the
isp gear is on ups, and that again is on a priority grid feed.
..will be producing my own power on this; geek code suggestions?:
> > ..ah. So with a 30GB /var ext3fs raid-1 I would have 25% or 13%
> > consumed by backup copies of the superblock and block group
> > descriptors?
> It's an order n**2 problem; so it's not a linear relationship. And
> most people get annoyed by that kind of overhead, long before it gets
> to 10% or above.
..so I'm tolerant. ;-)
> > ..how does the journalling system choose which blocks to work from?
> > What I've been able to see, the journal dies when their super blocks
> > go bad?
> The filesystem needs the superblock in order to find the journal. If
> you have a single gigantic filesystem mounted on /, then if the
> primary superblock is corrupted, the kernel will not be able to mount
> /, and you're hosed. E2fsck will automatically try the primary
> superblock, and if that is corrupt, it will try the first backup
> superblock. Failing that, a human will need to manually try one of
> the other backup superblocks, if it is corrupted as well.
..this can be tuned to try more blocks before whining for manpower?
> If your primary superblock is getting corrupted often, then first of
> all, you should try to figure out why this is happening, and take
> affirmative actions to prevent them. (The fact that you're reporting
> marginal power is supremely suspicious; marginal power can cause disk
> corruptions very easily. Getting higher quality power supplies will
> help, but a UPS is the first thing I would get.)
..yeah, I'm working on the power bit. ;-)
> Secondly, you're better off using a small root filesystem that
> generally isn't modified often. What I normally do is use a 128 meg
> root filesystem, with a separate /var partition (or /var symlinked to
> /usr/var), and /tmp as a ram disk. With the root filesystem rarely
> changing, it's much less likely that it will be corrupted due to
> hardware problems. Then the root filesystem can come up, and e2fsck
> can repair the other filesystems.
..yeah, except for /tmp on ramdisk, that's how I do my boxes,
and my isp business client is learning his lesson good. ;-)
> But I repeat, your filesystems shouldn't be getting corrupted in the
> first place. Using a separate root filesystem is a good idea, and
> will help you recover from hardware problems, but your primary
> priority should be to avoid the hardware problems in the first place.
> - Ted
..med vennlig hilsen = with Kind Regards from Arnt... ;-)
...with a number of polar bear hunters in his ancestry...
Scenarios always come in sets of three:
best case, worst case, and just in case.