[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ..fixing ext3 fs going read-only, was : Sendmail or Qmail ? ..

On Thu, 11 Sep 2003 14:03:17 -0400, 
Theodore Ts'o <tytso@mit.edu> wrote in message 

> On Thu, Sep 11, 2003 at 02:04:19AM +0200, Arnt Karlsen wrote:
> > ..I still believe in raid-1, but, ext3fs???  
> > 
> > ..how does xfs, jfs and Reiserfs compare?  
> If you have random disk corruptions happening as often as you are, no
> filesystem is going to be able to help you.  The only question is how
> quickly the filesystem notices *before* user data starts getting
> irrecovably lost.  Ext3 generally tends to be one of the more paranoid
> filesystems about checking assertions and "should never happen cases",
> although I don't know how it compares to reiserfs, jfs, et. al.  

..ok, how about ext3 versus ext2 on raid-1?

> > > Unless you're talking about *software* RAID-1 under Linux, and the
> > 
> > ..bingo, I should have said so.
> > 
> > > fact that you have to rebuild mirror after an unclean shutdown,
> > > but that's arguably a defect in the software RAID 1
> > > implementation.  On other systems, such as AIX's software RAID-1,
> > > the RAID-1 is implemented with a journal, 
> > 
> > ..but software RAID-1 under Linux is not or did I miss something
> > here?
> No, software RAID-1 does not do journalling at the RAID level.  That
> means that in the case of a unclean shutdown, the RAID system will
> need to restablish the mirror.  

..and after a journal death, and fsck, the raid set will be able 
to re-establish itself, no?  Or does the journal do both/all disks 
in a raid set?

> As I said, this is a performance issue, since half the disk bandwidth
> of the RAID array will be diverted to restablishing the mirror during
> the unclean shutdown. Note also this is true *regardless* of what
> filesystem you use, journaling and non-journaling.

..noted, non-issue in my case. 
> > ..ok, for my throttle boxes, here is where I should honk the 
> > horn and divert logging to a log server and schedule a fsck?
> > (And ofcourse just reboot my mailservers on the same error.)
> For your throttle boxes, do you need to have any writes to your
> filesystems at all?  If what you care about is zero downtime, why not
> just run syslog over the network, and keep all of your filesystems
> mounted read/only?  Some extreme configurations I've seen (especially
> where ISP's don't have direct/easy access to their systems at remote
> POP's), use a read-only flash filesystem, and a ramdisk for /tmp, and
> no spinning disks at all.  This significantly increases reliability
> caused by disk failures, since the hard drive is often the most
> vulnerable part of the system, especially in the face of heat
> vibrations, etc.

..sounds like an idea.  The major point against is geography, 
I like to arrive at stand-alone one-box solutions, but networked 
logging is a good way to verify the network status.  What is 
used, ssh tunnels?

> > ..IMHO the debian bootstrap should first read the rpm database 
> > and generate a deb database, and then do 'apt-get update && \
> > apt-get dist-upgrade'.  _Is_ there such a bootstrap beast?
> While this would be interesting for those people who are converting
> from Red Hat to Debian, it's a lot more complicated than that, since
> you also have to convert over the configuration files; Red Hat and
> Debian don't necessarily store files in the same location.

..I know.  ;-)

> I generally find that for production systems, it's much safer and
> simpler to install Debian on a new disk (and on a new system), and
> then copy over the new configuration files over.  That way, you can
> test the system and make sure everything is A-OK before cutting over
> something on a production system.
..yeah, my pipe dream.  ;-)

> (By the way, it seems like 50% of your problems is that you're doing
> things on the cheap, and yet you still want 100% reliability.  If you
> want "carrier-grade reliability", you need to pay a little bit extra,
> and do things like have hot spares, and installation scripts that
> allow you to create and configure new servers automatically, without
> needing manual handwork.)

..hey, the isp shop is not mine, and it _is_ a small operation, 
so I need to grow it so I can charge'em.  ;-)  These guys are 
Wintendo convertites, and I do the hard stuff for 'em.  ;-)
> > ..256MB, but the disks may be marginal, on the known bad disks I get
> > write errors.  I have seen this same error on power "blinks",
> > failures lasting for about a 1/3 of a second without losing monitor
> > sync etc on my desktops, once frying a power supply, but usually
> > these "blinks" cause no harm.
> Sounds like you have marginal power.  Do you have a UPS (preferably a
> continuous UPS) to protect your systems?  If not, why not?  (Again,
> it's a bad idea to expect "carrier-grade relaibility" when you're not
> willing pay for the basic high-quality equipment, backup equipment,
> and devices such as UPS's to protect your equipment.)

..2 different sites, I have marginal power in my lab, but the 
isp gear is on ups, and that again is on a priority grid feed.

..will be producing my own power on this; geek code suggestions?:
http://crest.org/discussion/gasification/199903/msg00055.html  ;-)

> > ..ah.  So with a 30GB /var ext3fs raid-1 I would have 25% or 13%
> > consumed by backup copies of the superblock and block group
> > descriptors?
> It's an order n**2 problem; so it's not a linear relationship.  And
> most people get annoyed by that kind of overhead, long before it gets
> to 10% or above.  

..so I'm tolerant.  ;-)

> > ..how does the journalling system choose which blocks to work from?
> > What I've been able to see, the journal dies when their super blocks
> > go bad?
> The filesystem needs the superblock in order to find the journal.  If
> you have a single gigantic filesystem mounted on /, then if the
> primary superblock is corrupted, the kernel will not be able to mount
> /, and you're hosed.  E2fsck will automatically try the primary
> superblock, and if that is corrupt, it will try the first backup
> superblock.  Failing that, a human will need to manually try one of
> the other backup superblocks, if it is corrupted as well.

..this can be tuned to try more blocks before whining for manpower?

> If your primary superblock is getting corrupted often, then first of
> all, you should try to figure out why this is happening, and take
> affirmative actions to prevent them.  (The fact that you're reporting
> marginal power is supremely suspicious; marginal power can cause disk
> corruptions very easily.  Getting higher quality power supplies will
> help, but a UPS is the first thing I would get.)

..yeah, I'm working on the power bit.  ;-)

> Secondly, you're better off using a small root filesystem that
> generally isn't modified often.  What I normally do is use a 128 meg
> root filesystem, with a separate /var partition (or /var symlinked to
> /usr/var), and /tmp as a ram disk.  With the root filesystem rarely
> changing, it's much less likely that it will be corrupted due to
> hardware problems.  Then the root filesystem can come up, and e2fsck
> can repair the other filesystems.

..yeah, except for /tmp on ramdisk, that's how I do my boxes, 
and my isp business client is learning his lesson good.  ;-)

> But I repeat, your filesystems shouldn't be getting corrupted in the
> first place.  Using a separate root filesystem is a good idea, and
> will help you recover from hardware problems, but your primary
> priority should be to avoid the hardware problems in the first place.
> 						- Ted

..med vennlig hilsen = with Kind Regards from Arnt... ;-)
...with a number of polar bear hunters in his ancestry...
  Scenarios always come in sets of three: 
  best case, worst case, and just in case.

Reply to: