Re: Weakest point of a server?

To: "Russell Coker" <russell@coker.com.au>, <debian-isp@lists.debian.org>
Subject: Re: Weakest point of a server?
From: "Jason Lim" <maillist@jasonlim.com>
Date: Fri, 7 Feb 2003 04:14:10 +0800
Message-id: <00ba01c2ce1c$4fe159d0$cb00a8c0@antivirus8>
Reply-to: "Jason Lim" <maillist@jasonlim.com>
References: <090601c2cde1$7d5826e0$cb00a8c0@antivirus8> <200302061511.58827.russell@coker.com.au>

> On Thu, 6 Feb 2003 14:13, Jason Lim wrote:
> > I was wondering what kind of failures you experience with long-running
> > hardware.
>
> I don't recall seeing a computer that had been in service for more than
3
> months fail in any way not associated with movement.  Moving parts (fans
and
> hard drives) die.  Expansion boards and motherboards can die if they are
> moved or tweaked.

Well, these systems are treated like kings (or you could say like a babe).
They are almost never shut down, touched, nor anything else beyond
cleaning the air filters. They are rackmount servers so the filters are in
the front and can be removed/cleaned/replaced without doing anything to
the system itself or moving anything (except the filter).

>
> If you only clean air filters while leaving the machine in place, and if
the
> fans are all solid with ball bearings then it should keep running for
many
> years.

We know sleeve-bearing fans die pretty quickly and that ball-bearing fans
tend to keep running much better/longer, but do you know approximately how
long we're talking about? Is "long" 3 years, 5 years, or possibly even
longer than that?

I know we're talking about something that is pretty variable, but I
suppose the better question is:

How long should one go before replacing a ball-bearing fan?

>From what others have said here fans seem to be the weakest point in a
server (with hard disks being the second), and since fans in some critical
places are VERY important (eg. CPU fan), a failure in those locations
could cause even more downtime than a failed hard disk, since hard disks
can have redundancy (RAID 1,5,etc.) but I've rarely heard people talk of
"redundant fans". Even in expensive DELL and HP servers there are usually
only 1 fan on each CPU.

> > Most of us run servers with very long uptimes (we've got a server here
> > with uptime approaching 3 years, which is not long compared to some,
but
> > we think it is pretty good!).
>
> I think that's a bad idea.  I've never seen a machine with an uptime of
>1
> year boot correctly.  In my experience after more than a year of running
> someone will have changed something that makes either the OS or the
important
> applications fail to start correctly and will have forgotten what they
did
> (or left the company).

Everyone that has worked for us has stayed with us, fortunately :-)
Everyone also keeps a log of what is done to a system. The only thing I've
seem happen to such a system is that when it is booted, fsck auto checks
the system and usually comes up with hundreds of errors or something. But
usually after those are fixed it boots up okay.

> > Most of these servers either have 3ware RAID cards, or have some other
> > sort of RAID (scsi, ide, software, etc.). The hard disks are replaced
as
> > they fail, so by now some RAID 1 drives are actually 40Gb when only
about
> > 20Gb is used, because the RAID hardware cannot "extend" to use the
extra
> > size (but this is a different issue).
>
> Software RAID can deal with this.

I will investigate this further.

> > Now... we can replace all the fans in the systems (eg. CPU fan, case
fans,
> > etc.). Some even suggested we jimmy on an extra fan going sideways on
the
> > CPU heatsick, so if the top fan fails at least airflow is still being
> > pushed around which is better than nothing (sort of like a redundant
CPU
> > fan system).
>
> Not a good idea for a server system.  Servers are designed to have air
flow in
> a particular path through the machine.  Change that in any way and you
might
> get unexpected problems.

For the non-brand-name rackmount servers, they usually aren't _that_ well
designed. Many of them can accomodate a variety of motherboard types,
which means that the location of the CPU is variable and could be located
in numerous places inside the chassis. Thus they can't assume the CPU will
be definitely in a particular place. I'm guessing that as long as the CPU
fan blows down and towards the rear of the server (exhaust) that it
follows with the general airflow of the system.

Even for expensive systems from DELL and such, they usually have the
general direction of the airflow going back. We could simply "enhance"
this effect by putting a sideways fan on the CPU pointing backwards.

All of these systems are at least 3U systems, so "heat" is not as critical
as compared with 1U or similar systems.

> > But how about the motherboards themselves? Is it often for something
on
> > the motherboard to fail, after 3-4 years continuous operation without
> > failure?
>
> I've only seen motherboards fail when having RAM, CPUs, or expansion
cards
> upgraded or replaced.
>
> I've heard of CPU and RAM failing, but only in situations where I was
not
> confident that they had not been messed with.

So I guess it is safe to say that in genreal, that CPU and RAM will not
fail provided they are not tampered with.

> > We keep the systems at between 18-22 degrees celcius (tending towards
the
> > lower end) as we've heard/read somewhere that for every degree drop in
> > temperature, hardware lifetime is extended by X number of years. Not
sure
> > if that is still true?
>
> Also try to avoid changes in temperature.  Thermal expansion is a
problem.
> Try to avoid having machines turned off for any period of time.  If
working
> on a server with old hard drives power the drives up and keep them
running
> unattached to the server while you are working for best reliability.
Turning
> an old hard drive off for 30 minutes is regarded as being a great risk.

Temperature is usually kept constant. Depending on where the server is in
the rack (top/bottom) the temperature tends to vary (top tends to be
between 2-3 degrees higher than bottom). On that note, what temperature
difference do you observe between top/bottom?

>
> But the best thing to do is regularly replace hard drives.  Hard drives
more
> than 3 years old should be thrown out.  The best thing to do is only buy
> reasonably large hard drives (say a minimum of 40G for IDE and 70G for
SCSI).
> Whenever a hard drive seems small it's probably due to be replaced.

The thing that really bites is that "40Gb" hard disks from different
manuacturers seem to have quite different formatted capacities... heck,
we've seen different capacities from the same manufacturer but slightly
different model numbers (but the same model)!

I guess one way would be to pre-purchase a whole bunch of matching-size
drives, but then you run the risk of only using them a couple of years
later, and then they might not start up at that time :-/

Got any suggestion to get around the above?

Thanks for the info!

Reply to:

Follow-Ups:
- Re: Weakest point of a server?
  - From: Russell Coker <russell@coker.com.au>

References:
- Weakest point of a server?
  - From: "Jason Lim" <maillist@jasonlim.com>
- Re: Weakest point of a server?
  - From: Russell Coker <russell@coker.com.au>

Prev by Date: Re: Weakest point of a server?
Next by Date: Re: IMAP / POP... permissions?
Previous by thread: Re: Weakest point of a server?
Next by thread: Re: Weakest point of a server?
Index(es):
- Date
- Thread