[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Distributed location server monitoring



Hello,

Yes, Nagios does distributed monitoring:

http://nagios.sourceforge.net/docs/2_0/distributed.html

However, the problem you're describing doesn't seem to be related to the 
number of Nagios servers that you're using and adding more servers may 
only add unnecessary complexity.  Make sure that you have the upstream 
hops defined as being monitored in Nagios *and* marked as parents of the 
servers that you're monitoring.  Then if one of those upstream hops goes 
down, don't notify on it.  This of course assumes that you're sure that 
if the upstreams go down that it doesn't affect the connectivity of the 
server being monitored.  Alternately, tweak the flapping or volatility 
of the hops in between the monitor and the server being monitored.

There is a reason why Nagios is reporting on those hops being down, so 
you might want to look at why things are being reported as down.  If 
Nagios sends a notification then that means that the service has been 
down for several successive checks/minutes, which is fairly uncommon 
unless there really is a problem.  It's not a 'false positive' from the 
Nagios server's view, so jump on the server and try to replicate the 
problem that Nagios is reporting.  If you need to, tweak the number of 
failed checks before notification and again, getting the parent/child 
relationships of the monitored services configured will help.

Just on the basis of the limited information given in your e-mail it 
sounds like you need to tune the Nagios configs to your environment to 
reduce the false positives rather than adding more monitoring servers.  
Once you have the configs fairly tuned then you can think about creating 
multiple monitoring points.

Steve Suehring
http://www.braingia.org

On Sat, May 10, 2008 at 03:33:09PM +0800, Thomas Goirand wrote:
> Hi,
> 
> We use Nagios internally to monitor about 50 servers. The biggest
> problem that we have is that it sends lot's of false positive because it
> monitors more the connections between one point to another instead of
> the real services that have to be up. The rate of false positive is just
> too high, so it's kind of unusable. We ignore too many warnings, and I'm
> sure it will end up with something really down and we wont check for it.
> 
> Is there a distributed kind-of nagios system that would use multiple
> nodes to check, and if (and ONLY if) all contactable monitoring servers
> report a problem, then we receive an alert ?
> 
> Thomas
> 
> P.S: We don't want to have multiple points where to setup monitoring,
> that would be head hakes...
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org


Reply to: