[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Weakest point of a server?



On Thu, 6 Feb 2003 14:13, Jason Lim wrote:
> I was wondering what kind of failures you experience with long-running
> hardware.

I don't recall seeing a computer that had been in service for more than 3 
months fail in any way not associated with movement.  Moving parts (fans and 
hard drives) die.  Expansion boards and motherboards can die if they are 
moved or tweaked.

If you only clean air filters while leaving the machine in place, and if the 
fans are all solid with ball bearings then it should keep running for many 
years.

> Most of us run servers with very long uptimes (we've got a server here
> with uptime approaching 3 years, which is not long compared to some, but
> we think it is pretty good!).

I think that's a bad idea.  I've never seen a machine with an uptime of >1 
year boot correctly.  In my experience after more than a year of running 
someone will have changed something that makes either the OS or the important 
applications fail to start correctly and will have forgotten what they did 
(or left the company).

> Most of these servers either have 3ware RAID cards, or have some other
> sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as
> they fail, so by now some RAID 1 drives are actually 40Gb when only about
> 20Gb is used, because the RAID hardware cannot "extend" to use the extra
> size (but this is a different issue).

Software RAID can deal with this.

> Now... we can replace all the fans in the systems (eg. CPU fan, case fans,
> etc.). Some even suggested we jimmy on an extra fan going sideways on the
> CPU heatsick, so if the top fan fails at least airflow is still being
> pushed around which is better than nothing (sort of like a redundant CPU
> fan system).

Not a good idea for a server system.  Servers are designed to have air flow in 
a particular path through the machine.  Change that in any way and you might 
get unexpected problems.

> But how about the motherboards themselves? Is it often for something on
> the motherboard to fail, after 3-4 years continuous operation without
> failure?

I've only seen motherboards fail when having RAM, CPUs, or expansion cards 
upgraded or replaced.

I've heard of CPU and RAM failing, but only in situations where I was not 
confident that they had not been messed with.

> We keep the systems at between 18-22 degrees celcius (tending towards the
> lower end) as we've heard/read somewhere that for every degree drop in
> temperature, hardware lifetime is extended by X number of years. Not sure
> if that is still true?

Also try to avoid changes in temperature.  Thermal expansion is a problem.  
Try to avoid having machines turned off for any period of time.  If working 
on a server with old hard drives power the drives up and keep them running 
unattached to the server while you are working for best reliability.  Turning 
an old hard drive off for 30 minutes is regarded as being a great risk.

But the best thing to do is regularly replace hard drives.  Hard drives more 
than 3 years old should be thrown out.  The best thing to do is only buy 
reasonably large hard drives (say a minimum of 40G for IDE and 70G for SCSI).  
Whenever a hard drive seems small it's probably due to be replaced.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page



Reply to: