[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: how to make Debian less fragile (long and philosophical)



Why should I bring my httpd down? Why should I bring my named down? Why
should I bring my database down? Why should I bring anything down?

What if I am a web retailer and this is my biggest day of the year, and 
every half hour of traffic counts for about 1% of my annual sales? 

I know of at least one company where this is true--30% of their business
comes on one day of the year, and 50% of that is via the web, in a 16 
hour period they get 15% of their annual business through their web
server--you better believe they want it to stay operational, even 
through various kinds of failures. Having enough static tools lying 
around to keep the thing limping for another fifteen minutes may buy
me the time to effect a switchover to some other machine, and copy 
the all important last couple of hours worth of orders to a tape.

I don't think the Debian distribution has contemplated supporting this
kind of an environment. Not when the need for durable, reliably recovery
tools is so easily shrugged off with "that's what boot disks are for."

Note: Anything that was linked and loaded before the failure occurred will 
still be happily chugging along. Even though the library is now busted, 
my database is already linked and loaded and running.

Why should I shut it down? I want those services to stay operational, 
even though much of my system has been hosed. Even if there's a hardware
failure, I might want to keep everything operational long enough for 
a DNS change to propagate so that traffic has moved to another machine
offering the same service.

Note my other message where I described having a system's IDE cable 
fall off, disconnecting the root drive. I was able to keep the system
up and running for a full half an hour before the eventual kernel 
panic. It allowed me to copy critical data off the system, and calmly
close down all the running services in a safe and secure way, rather
than having them go down in flames. 

I had lost / and /usr, but not /local or /u which were on the other 
IDE cable, which was still connected. I was able to keep it going because
most programs were already linked and loaded, and critical applications 
like /bin/sh and /bin/cp had been recently used and so could still be 
loaded out of the disk cache--and they were statically linked, so 
they worked (any attempt to "ls /" would block permanently though, 
locking that tty so that it was permanently unusable).

The point there is, even if I will have to eventually reboot due to 
some hardware fault, that doesn't necessarily mean I want to reboot NOW.
I might be a lot happier with a system that can limp along through the 
rest of the business day, or at least for another half hour or so, than 
with one that crashes and burns immediately upon any kind of failure, 
or which provides me with no way to shut things down cleanly because
it provides me with no shell and no usable executables.

Justin

On Tue, Aug 17, 1999 at 01:38:24PM -0400, Michael Stone wrote:
> On Tue, Aug 17, 1999 at 01:16:28PM -0400, you wrote:
> > I do often have backup roots on the system--so that I can reboot 
> > without being physically present at the machine. The minimal requirement
> > when a backup root is available is that:
> > 
> >   1. rebooting is OK 
> 
> You have yet to justify why that's the case. Repeating it still doesn't
> make it true.
> 
> Mike Stone



Reply to: