[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Uptime (Was: Re: [OT, FLAME] Linux Sucks)



On Mon, 07 Apr 2003, nate wrote:

> Jamie Lawrence said:
> 
> > I hate to join in the flamefest, but if that's the case, you need
> > to find a different admin for those servers.
> >
> > If a professional admin can't ensure the server they run is going to boot,
> > they need some remedial help, some process control, or a good beating.
> > Sure, mistakes happen, but the methodology of change control for *nix
> > machines isn't that hard.
> 
> 
> (pouring gas on the fire)
> 
> I guess you haven't run any machines that have extremely long uptimes?

At my last full-time admin job, I moved a cluster of ~30 machines, 
mostly Sparcs, about 75% of which had been up for over a year.
When I left that job, most of them had uptimes of over a year.
(It has been a while now, but I think every box that didn't either had a
hardware failure or had an OS upgrade as part of being repurposed.)

> the usual problems are just some software doesn't load on boot, it's

That's what I was referring to. Forgetting to write an init script
is a sign of either carelessness or a newbie. These things should be
tested when the software is installed in the first place. Things that
need a reboot to test (network config, kernels upgrades (not counting
highend Solaris tricks here), etc.), well, give you an opportunity to
test.

> more rare in my experience to have hardware fail in the space of
> restarting a server.

Hardware failure, of course, is not an admin's fault. In an HA
environment, not ordering spares is. (If you don't have uptime
commitments or budget, then of course this doesn't apply.)

> [...] To some extent
> I feel the same way about the routers that I ran, always was sure
> to write the config to flash before rebooting them just incase.

I have always been fully confident in my routers. They pull OS 
and config from change-controlled servers on boot. Never had a boot 
problem, aside from testing new OSes, stupid router tricks, etc.
(The main thing to watch there is power failures... this isn't the time
to go in to network architecture, though.)

> which is one reason I got upset at freebsd this past week, upgraded
> from 4.7 to 4.8 and was shocked that I had to upgrade the kernel
> AND reboot the box before ps would work again. By contrast most of

Used to run nothing but Free- and Open- BSD. Still love the OS, but my
machines are in California, and I no longer am, and nobody at the colo
where they live understands anything besides Linux. On the bright side,
I found Debian, and have seen the light... (Debian Saves? WWDD?)
 
> and I'm not alone I'm sure. I know tons of system/network admins
> and have never known one who ran reliable servers to not have
> a case when they didn't make an init script for something they wanted
> to load on boot before they restarted a machine.

Sure. I've made a ton of mistakes. Gotta learn somehow... and I'm a lot
more fast and loose on my personal machines. I'm just saying there is no
excuse for sloppy change control in a production environment.

Anyway, Sorry to perpetuate an annoying thread. I was grumpy and should
have known better.

-j



-- 
Jamie Lawrence                                        jal@jal.org
"I'm a bastard. I have absolutely no clue why people can ever 
think otherwise."
   - Linus Torvalds



Reply to: