Re: Weakest point of a server?
On Thu, Feb 06, 2003 at 09:13:06PM +0800, Jason Lim wrote:
> Hi all,
> I was wondering what kind of failures you experience with long-running
Mostly mechanical parts like Fans, Harddisks.
CPUs can normaly run arround 10Years without problems, as far as i know.
> Most of us run servers with very long uptimes (we've got a server here
> with uptime approaching 3 years, which is not long compared to some, but
> we think it is pretty good!).
> We're looking at "extending" the life of some of these servers, but are
> reluctant to replace all the hardware, especially since what is there
> Most of these servers either have 3ware RAID cards, or have some other
> sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as
> they fail, so by now some RAID 1 drives are actually 40Gb when only about
> 20Gb is used, because the RAID hardware cannot "extend" to use the extra
> size (but this is a different issue).
You can detect indicies for a soon failure with smartmontools.
This Tools read the SMART values/log must modern harddisk provide.
Often there are messages in /var/log/messages with indicate Harddisk
> Now... we can replace all the fans in the systems (eg. CPU fan, case fans,
> etc.). Some even suggested we jimmy on an extra fan going sideways on the
> CPU heatsick, so if the top fan fails at least airflow is still being
> pushed around which is better than nothing (sort of like a redundant CPU
> fan system).
You can monitor cpu/case temparature with the sensors package.
Also Voltages of the Mainboard. (power supply)
And also Speed of Fans. (often they get slower an slower before failure)
> But how about the motherboards themselves? Is it often for something on
> the motherboard to fail, after 3-4 years continuous operation without
> Or is there some other part(s) we should look out for instead... would the
> CPU itself die after 3 years continuous operation? Or maybe RAM? Or even
> the LAN cards?
RAM is also not so often.
NICs more often.(voltage peeks or things like this ???)
You can monitor them with mii-tool ?
You can build failover with the bonding driver of the kernel, as far as
Not all cards/drivers supply right mii informations.
> We keep the systems at between 18-22 degrees celcius (tending towards the
> lower end) as we've heard/read somewhere that for every degree drop in
> temperature, hardware lifetime is extended by X number of years. Not sure
> if that is still true?
I dont think modifying cooling system of a server is a good thing,
because most systems are allready optimizied for a good air flow.
> Any input/suggestions would be greatly appreciated.
Its allways good to monitor your systems.
There are a lot more thinks you can monitor(ups, Network ...)
For bigger installations you can use a centralized monitoring
They can normaly run all the previous checks and you
notify you by mail, pager, sms ...
A few Monitoring Servers:
BigSister (a GPL clone)
#_~`--'__ `===-, Markus Benning <email@example.com>
`.`. `#.,// http://www.w3r3wolf.de
,_\_\ ## #\
`__.__ `####\ Open Source is a philosophy
~~\ ,###'~ not a price tag !