Re: ..fixing ext3 fs going read-only, was : Sendmail or Qmail ? ..
On Wed, 10 Sep 2003 14:39:44 -0400,
Theodore Ts'o <email@example.com> wrote in message
> On Wed, Sep 10, 2003 at 01:36:32AM +0200, Arnt Karlsen wrote:
> > > But for an unattended server, most of the time it's probably
> > > better to force the system to reboot so you can restore service
> > > ASAP.
> > ..even for raid-1 disks??? _Is_ there a combination of raid-1 and
> > journalling fs'es for linux that's ready for carrier grade service?
> I'm not sure what you're referring to here.
..isp gateway boxes that I like to keep running 24/7/365.2442etc.
The idea behind using ext3fs and raid-1 was to minimize the
risks and downtime.
..I still believe in raid-1, but, ext3fs???
..how does xfs, jfs and Reiserfs compare?
..what I like with ext3 is it can be mounted as ext2
in a pinch, assuming you can get to /etc/fstab. ;-)
> As far as I'm concerned, if the filesystem is inconsistent, panic'ing
> and letting the system get back to a known state is always the right
> answer. RAID-1 shouldn't be an issue here.
..shouldn't be a problem, agreed.
> Unless you're talking about *software* RAID-1 under Linux, and the
..bingo, I should have said so.
> fact that you have to rebuild mirror after an unclean shutdown, but
> that's arguably a defect in the software RAID 1 implementation. On
> other systems, such as AIX's software RAID-1, the RAID-1 is
> implemented with a journal,
..but software RAID-1 under Linux is not or did I miss something here?
> so that there is no need to rebuild the
> mirror after an unclean shutdown. Alternatively, you could use a
> hardware RAID-1 solution, which also wouldn't have a problem with an
> unclean shutdowns.
> In any case, the speed hit for doing an panic with the current Linux
> MD implementation is a performance issue, and in my book reliability
0> takes precedence over performance. So yes, even for RAID-1, and it
> doesn't matter what filesystem, if there's a problem, you should
> reboot. If you don't like the resulting performance hit after the
> panic, get a hardware RAID controller.
..agreed, and disagreed; for my isp gateway throttles, reboots means
isp service downtime. Logs can be tee'ed to log servers. For a mail
server I agree fully.
> > > I'm not sure what you mean by this. When there is a filesystem
> > > error
> > ..add an "healthy" dose of irony to repair in "repair". ;-)
> > > detected, all writes to the filesystem are immediately aborted,
> > > which
> > ...precludes reporting the error?
> No, if you are using a networked syslog daemon, it certainly does
> preclode reporting the error. If you mean the case where there is a
> filesystem error on the partition where /var/log resides, yes, we
> consider it better to abort writes to the filesystem than to attempt
> to write out the log message to a compromised filesystem.
..ok, for my throttle boxes, here is where I should honk the
horn and divert logging to a log server and schedule a fsck?
(And ofcourse just reboot my mailservers on the same error.)
..bottom line is that same journal death needs different
medication depending on which purpose etc the box serves.
> > .._exactly_, but it is not reported to any of the system users.
> > A system reboot _is_ reported usefully to the system users, all
> > tty users get the news.
> The message that a filesystem has been remounted read-only is logged
> as a KERN_CRIT message. If you wish, you can configure your
> syslog.conf so that all tty users are notified of kern.crit level
..doh! I _like_ fixes this simple. ;-)
> errors. That's probably a good thing, although it's not clear that a
> typical user will understand what to do when they are a told that a
> filesystem has been remounted read-only.
..so clue whack'em. On a desktop they are not gonna loose much more
than 5 seconds worth of work with the default commits, and scaring
them with 30 years research work loss is good way to slap'em into
doing the right things.
> Certainly it is trivial to configure sysklogd to grab that message and
> do whatever you would like with it, if you were to so choose. If you
> want to "honk the big horn", that is certainly within your power to
> make the system do that.
> If you believe that Red Hat should configure their syslog.conf files
> to do this by default, feel free to submit a bug report / suggestion
> with Red Hat.
..heh, the last time I tried that, was:
..the "This is a duplicate of various other bugs. We're looking at the
issues involved." is what brought me over to Debian.
..now, if installing and raid disks could be as easy...
..IMHO the debian bootstrap should first read the rpm database
and generate a deb database, and then do 'apt-get update && \
apt-get dist-upgrade'. _Is_ there such a bootstrap beast?
> > > of uncommitted data which has not been written out to disk.) So
> > > in general, not running the journal will leave you in a worse
> > > state after rebooting, compared to running the journal.
> > ..it appears my experience disagrees with your expertize here.
> > With more data, I would have been able to advice intelligently
> > on when to and when not to run the journal, I believe we agree
> > not running the journal is adviceable if the system has been
> > left limping like this for a few hours.
> How long the system has been left limping doesn't really matter. The
> real issue is that there may be critical data that has been written to
> the journal that was not written to the filesystem before the journal
> was aborted and the filesystem left in a read-only state. This might,
> for example, include a user's thesis or several year's of research.
> (Why such work might not be backed up is a question I will leave for
> another day, and falls into the "criminally negligent system
> administrator" category....)
..why I said _No way!_ on a tape back-up box job half a year
ago, after spending 2 weeks on it. ;-)
> In general, you're better off running the journal after a journal
> abort. You have may think you have experiences to the contrary, but
> are you sure? Unless you snapshot the entire filesystem, and try it
> both ways, you can't really know for sure. There are classes of
> errors where the filesystem has been completely trashed, and whether
> or not you run the journal won't make a bit of difference.
..ok, the last 2 boxes pancaking on this was installed by my
isp client, with _everything_ but /boot under "LABEL=/". Yeeha.
Leaving only the swap for chroot installs that I like to
try out in my lab first.
..is there no way to force remount,rw so the damned thing can be
remounted on ext2 fs'es? (Other than drop ext3 support and replace
the kernel when /boot is writeable, not always the case.)
> The much more important question is to figure out why the filesystem
> got trashed in the first place. Do you have marginal memory? hard
..256MB, but the disks may be marginal, on the known bad disks I get
write errors. I have seen this same error on power "blinks", failures
lasting for about a 1/3 of a second without losing monitor sync etc
on my desktops, once frying a power supply, but usually these "blinks"
cause no harm.
> Are you running a beta-test kernel that might be buggy?
..possibly, Red Hat's own SRPM's with the Patch-o-Matic iptables and
in one case also the MPPE tunnel modules, on my remaining RH boxes.
..on my debian boxes I stick to stable and 2.4, it has what I need.
> Fixing the proximate cause is always the most important thing to do;
> since in the end, no matter how clever a filesystem, if you have buggy
> hardware or buggy device drivers, in the end you *will* be screwed. A
> filesystem can't compensate for those sorts of shortcomings.
..agreed, but it should try help and not fight the diagnosing.
> > ..and, on a raid-1 disk set, a failure oughtta cut off the one bad
> > fs and not shoot down the entire raid set because that one fs fails.
> I agree. When is that not happening?
..I/we need test data here. Come to think of it, all my ext3
journal failures has been on stand alone disks.
> > ..sparse_super is IMNTHOAIME _not_ worth the saved disk space,
> > and should _not_ be the default setup option.
> Interesting assertion. I disagree; if you'd like to back up this
> assertion with some arguments, I'll be happy to discuss it.
> I will note that a 128 meg filesystem still has half a dozen backup
> superblocks, which should be more than enough to recover from a disk
> error. For truly large filesystems without sparse_super, the disk
> space consumed is order O(n**2), which means that for a filesystem
> which is 64GB and using 1k blocks (say because it is used for storing
> Usenet articles), 50% of the space --- 32GB out of the 64GB --- will
> be consumed by backup copies of the superblock and block group
> descriptors. There is a very good reason why sparse_super is turned
> on by default.
..ah. So with a 30GB /var ext3fs raid-1 I would have 25% or 13%
consumed by backup copies of the superblock and block group descriptors?
..12% is ok, 25% is, hummm, a tad high, but survivable, leaves about
20GB useful space.
..how does the journalling system choose which blocks to work from?
What I've been able to see, the journal dies when their super blocks
> > ..180 days is IMNTHOAIME _much_ too long between fsck's. Reboots
> > defeats the point with /usr/bin/uptime and cause downtime, too.
> This is configurable, and ultimately is up to each individual system
> administrator. Many people complain bitterly about the forced fsck
..agreed, and for very good reason when the fsck means DOS.
_That_ is why I like to schedule them.
> I will note that much depends on your hardware. If you have quality
> hardware, and you're running a good, stable, well-tested production
> kernel, in practice fsck should never turn up any errors, and how
> often you run it is simply a function of how paranoid you're feeling.
> You should not be depending on fsck to find any problems. If you are,
> then there's something desperately wrong with your system, and you
> should find and fix the problems.
..ok, on my desktop I could not remount /var rw as ext2 after I had
the ext3 journal die last night, on reboot fsck could not fix the fs,
but it now works _nicely_ rw as ext2. Ok, this is _not_ acceptable
for my isp gateways. ;-)
> If you are using EVMS or some other system where you can take
> read-only snapshots, something which you *can* do is to periodically
> (Tuesday morning at 3am for example), have a cron script take a
> read-only snapshot, and then run e2fsck on the read-only snapshot and
> then discard the snapshot. If the e2fsck returns no errors, you can
> use tune2fs to set the last checked time on the mounted filesystem.
..good idea, I'll check this out.
..med vennlig hilsen = with Kind Regards from Arnt... ;-)
...with a number of polar bear hunters in his ancestry...
Scenarios always come in sets of three:
best case, worst case, and just in case.