[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ..fixing ext3 fs going read-only, was : Sendmail or Qmail ? ..



On Wed, Sep 10, 2003 at 01:36:32AM +0200, Arnt Karlsen wrote:
> > But for an unattended server, most of the time it's probably better to
> > force the system to reboot so you can restore service ASAP.
> 
> ..even for raid-1 disks???  _Is_ there a combination of raid-1 and 
> journalling fs'es for linux that's ready for carrier grade service?

I'm not sure what you're referring to here.  As far as I'm concerned,
if the filesystem is inconsistent, panic'ing and letting the system
get back to a known state is always the right answer.  RAID-1
shouldn't be an issue here.  

Unless you're talking about *software* RAID-1 under Linux, and the
fact that you have to rebuild mirror after an unclean shutdown, but
that's arguably a defect in the software RAID 1 implementation.  On
other systems, such as AIX's software RAID-1, the RAID-1 is
implemented with a journal, so that there is no need to rebuild the
mirror after an unclean shutdown.  Alternatively, you could use a
hardware RAID-1 solution, which also wouldn't have a problem with an
unclean shutdowns.

In any case, the speed hit for doing an panic with the current Linux
MD implementation is a performance issue, and in my book reliability
takes precedence over performance.  So yes, even for RAID-1, and it
doesn't matter what filesystem, if there's a problem, you should
reboot.  If you don't like the resulting performance hit after the
panic, get a hardware RAID controller.

> > I'm not sure what you mean by this.  When there is a filesystem error
> 
> ..add an "healthy" dose of irony to repair in "repair".  ;-)
> 
> > detected, all writes to the filesystem are immediately aborted, which
> 
> ...precludes reporting the error?  

No, if you are using a networked syslog daemon, it certainly does
preclode reporting the error.  If you mean the case where there is a
filesystem error on the partition where /var/log resides, yes, we
consider it better to abort writes to the filesystem than to attempt
to write out the log message to a compromised filesystem.

> .._exactly_, but it is not reported to any of the system users.  
> A system reboot _is_ reported usefully to the system users, all 
> tty users get the news.

The message that a filesystem has been remounted read-only is logged
as a KERN_CRIT message.  If you wish, you can configure your
syslog.conf so that all tty users are notified of kern.crit level
errors.  That's probably a good thing, although it's not clear that a
typical user will understand what to do when they are a told that a
filesystem has been remounted read-only.

Certainly it is trivial to configure sysklogd to grab that message and
do whatever you would like with it, if you were to so choose.  If you
want to "honk the big horn", that is certainly within your power to
make the system do that.

If you believe that Red Hat should configure their syslog.conf files
to do this by default, feel free to submit a bug report / suggestion
with Red Hat.

> > of uncommitted data which has not been written out to disk.)  So in
> > general, not running the journal will leave you in a worse state after
> > rebooting, compared to running the journal.
> 
> ..it appears my experience disagrees with your expertize here.
> With more data, I would have been able to advice intelligently 
> on when to and when not to run the journal, I believe we agree 
> not running the journal is adviceable if the system has been 
> left limping like this for a few hours.

How long the system has been left limping doesn't really matter.  The
real issue is that there may be critical data that has been written to
the journal that was not written to the filesystem before the journal
was aborted and the filesystem left in a read-only state.  This might,
for example, include a user's thesis or several year's of research.
(Why such work might not be backed up is a question I will leave for
another day, and falls into the "criminally negligent system
administrator" category....)

In general, you're better off running the journal after a journal
abort.  You have may think you have experiences to the contrary, but
are you sure?  Unless you snapshot the entire filesystem, and try it
both ways, you can't really know for sure.  There are classes of
errors where the filesystem has been completely trashed, and whether
or not you run the journal won't make a bit of difference.  

The much more important question is to figure out why the filesystem
got trashed in the first place.  Do you have marginal memory?  hard
drives?  Are you running a beta-test kernel that might be buggy?
Fixing the proximate cause is always the most important thing to do;
since in the end, no matter how clever a filesystem, if you have buggy
hardware or buggy device drivers, in the end you *will* be screwed.  A
filesystem can't compensate for those sorts of shortcomings.

> ..and, on a raid-1 disk set, a failure oughtta cut off the one bad 
> fs and not shoot down the entire raid set because that one fs fails.

I agree.  When is that not happening?

> ..sparse_super is IMNTHOAIME _not_ worth the saved disk space, 
> and should _not_ be the default setup option.

Interesting assertion.  I disagree; if you'd like to back up this
assertion with some arguments, I'll be happy to discuss it.  

I will note that a 128 meg filesystem still has half a dozen backup
superblocks, which should be more than enough to recover from a disk
error.  For truly large filesystems without sparse_super, the disk
space consumed is order O(n**2), which means that for a filesystem
which is 64GB and using 1k blocks (say because it is used for storing
Usenet articles), 50% of the space --- 32GB out of the 64GB --- will
be consumed by backup copies of the superblock and block group
descriptors.  There is a very good reason why sparse_super is turned
on by default.

> ..180 days is IMNTHOAIME _much_ too long between fsck's.  Reboots 
> defeats the point with /usr/bin/uptime and cause downtime, too.

This is configurable, and ultimately is up to each individual system
administrator.  Many people complain bitterly about the forced fsck
checks.  

I will note that much depends on your hardware.  If you have quality
hardware, and you're running a good, stable, well-tested production
kernel, in practice fsck should never turn up any errors, and how
often you run it is simply a function of how paranoid you're feeling.
You should not be depending on fsck to find any problems.  If you are,
then there's something desperately wrong with your system, and you
should find and fix the problems.

If you are using EVMS or some other system where you can take
read-only snapshots, something which you *can* do is to periodically
(Tuesday morning at 3am for example), have a cron script take a
read-only snapshot, and then run e2fsck on the read-only snapshot and
then discard the snapshot.  If the e2fsck returns no errors, you can
use tune2fs to set the last checked time on the mounted filesystem.

						- Ted



Reply to: