Re: Weakest point of a server?
- To: Jason Lim <maillist@jasonlim.com>
- Cc: debian-isp@lists.debian.org
- Subject: Re: Weakest point of a server?
- From: me@w3r3wolf.de
- Date: Thu, 6 Feb 2003 23:12:21 +0100
- Message-id: <20030206221221.GC5301@no>
- In-reply-to: <090601c2cde1$7d5826e0$cb00a8c0@antivirus8>
- References: <090601c2cde1$7d5826e0$cb00a8c0@antivirus8>
Hi all,
On Thu, Feb 06, 2003 at 09:13:06PM +0800, Jason Lim wrote:
> Hi all,
>
> I was wondering what kind of failures you experience with long-running
> hardware.
Mostly mechanical parts like Fans, Harddisks.
CPUs can normaly run arround 10Years without problems, as far as i know.
> Most of us run servers with very long uptimes (we've got a server here
> with uptime approaching 3 years, which is not long compared to some, but
> we think it is pretty good!).
>
> We're looking at "extending" the life of some of these servers, but are
> reluctant to replace all the hardware, especially since what is there
> "works"...
>
> Most of these servers either have 3ware RAID cards, or have some other
> sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as
> they fail, so by now some RAID 1 drives are actually 40Gb when only about
> 20Gb is used, because the RAID hardware cannot "extend" to use the extra
> size (but this is a different issue).
You can detect indicies for a soon failure with smartmontools.
This Tools read the SMART values/log must modern harddisk provide.
Often there are messages in /var/log/messages with indicate Harddisk
problems.
> Now... we can replace all the fans in the systems (eg. CPU fan, case fans,
> etc.). Some even suggested we jimmy on an extra fan going sideways on the
> CPU heatsick, so if the top fan fails at least airflow is still being
> pushed around which is better than nothing (sort of like a redundant CPU
> fan system).
You can monitor cpu/case temparature with the sensors package.
Also Voltages of the Mainboard. (power supply)
And also Speed of Fans. (often they get slower an slower before failure)
> But how about the motherboards themselves? Is it often for something on
> the motherboard to fail, after 3-4 years continuous operation without
> failure?
>
> Or is there some other part(s) we should look out for instead... would the
> CPU itself die after 3 years continuous operation? Or maybe RAM? Or even
> the LAN cards?
RAM is also not so often.
NICs more often.(voltage peeks or things like this ???)
You can monitor them with mii-tool ?
You can build failover with the bonding driver of the kernel, as far as
i know.
Not all cards/drivers supply right mii informations.
> We keep the systems at between 18-22 degrees celcius (tending towards the
> lower end) as we've heard/read somewhere that for every degree drop in
> temperature, hardware lifetime is extended by X number of years. Not sure
> if that is still true?
I dont think modifying cooling system of a server is a good thing,
because most systems are allready optimizied for a good air flow.
> Any input/suggestions would be greatly appreciated.
Its allways good to monitor your systems.
There are a lot more thinks you can monitor(ups, Network ...)
For bigger installations you can use a centralized monitoring
Server(s).
They can normaly run all the previous checks and you
notify you by mail, pager, sms ...
A few Monitoring Servers:
Nagios(NetSaint) (GPL)
BigBrother (commercial)
BigSister (a GPL clone)
Markus
_ ___
#_~`--'__ `===-, Markus Benning <me@w3r3wolf.de>
`.`. `#.,// http://www.w3r3wolf.de
,_\_\ ## #\
`__.__ `####\ Open Source is a philosophy
~~\ ,###'~ not a price tag !
\##'
Reply to: