[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Weakest point of a server?



Hi all,

On Thu, Feb 06, 2003 at 09:13:06PM +0800, Jason Lim wrote:
> Hi all,
> 
> I was wondering what kind of failures you experience with long-running
> hardware.

Mostly mechanical parts like Fans, Harddisks.

CPUs can normaly run arround 10Years without problems, as far as i know.

> Most of us run servers with very long uptimes (we've got a server here
> with uptime approaching 3 years, which is not long compared to some, but
> we think it is pretty good!).
>
> We're looking at "extending" the life of some of these servers, but are
> reluctant to replace all the hardware, especially since what is there
> "works"...
> 
> Most of these servers either have 3ware RAID cards, or have some other
> sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as
> they fail, so by now some RAID 1 drives are actually 40Gb when only about
> 20Gb is used, because the RAID hardware cannot "extend" to use the extra
> size (but this is a different issue).

You can detect indicies for a soon failure with smartmontools.
This Tools read the SMART values/log must modern harddisk provide.

Often there are messages in /var/log/messages with indicate Harddisk
problems.

> Now... we can replace all the fans in the systems (eg. CPU fan, case fans,
> etc.). Some even suggested we jimmy on an extra fan going sideways on the
> CPU heatsick, so if the top fan fails at least airflow is still being
> pushed around which is better than nothing (sort of like a redundant CPU
> fan system).

You can monitor cpu/case temparature with the sensors package.
Also Voltages of the Mainboard. (power supply)
And also Speed of Fans. (often they get slower an slower before failure)

> But how about the motherboards themselves? Is it often for something on
> the motherboard to fail, after 3-4 years continuous operation without
> failure?
> 
> Or is there some other part(s) we should look out for instead... would the
> CPU itself die after 3 years continuous operation? Or maybe RAM? Or even
> the LAN cards?

RAM is also not so often.
NICs more often.(voltage peeks or things like this ???)
You can monitor them with mii-tool ?
You can build failover with the bonding driver of the kernel, as far as
i know.
Not all cards/drivers supply right mii informations.

> We keep the systems at between 18-22 degrees celcius (tending towards the
> lower end) as we've heard/read somewhere that for every degree drop in
> temperature, hardware lifetime is extended by X number of years. Not sure
> if that is still true?

I dont think modifying cooling system of a server is a good thing,
because most systems are allready optimizied for a good air flow.

> Any input/suggestions would be greatly appreciated.

Its allways good to monitor your systems.
There are a lot more thinks you can monitor(ups, Network ...)

For bigger installations you can use a centralized monitoring
Server(s).
They can normaly run all the previous checks and you
notify you by mail, pager, sms ...

A few Monitoring Servers:
Nagios(NetSaint) (GPL)
BigBrother (commercial)
BigSister (a GPL clone)

Markus
  _     ___
 #_~`--'__ `===-,  Markus Benning <me@w3r3wolf.de>
 `.`.     `#.,//   http://www.w3r3wolf.de
 ,_\_\     ## #\   
 `__.__    `####\  Open Source is a philosophy
      ~~\ ,###'~   not a price tag !
         \##'



Reply to: