Re: Distributed monitoring

To: debian-isp@lists.debian.org
Subject: Re: Distributed monitoring
From: "Jesús M. Navarro" <jesus.navarro@undominio.net>
Date: Sun, 29 Mar 2009 20:12:52 +0200
Message-id: <[🔎] 200903292012.52376.jesus.navarro@undominio.net>
In-reply-to: <[🔎] 49CFA445.5070102@goirand.fr>
References: <[🔎] 49CE59DA.5040505@goirand.fr> <[🔎] 200903291441.00906.jesus.navarro@undominio.net> <[🔎] 49CFA445.5070102@goirand.fr>

Hi:

On Sunday 29 March 2009 18:39:33 Thomas Goirand wrote:
> Jesús M. Navarro wrote:
> >> First of all, yes, we do have implemented topology and dependencies, and
> >> we do not receive UNKNOWN status already. But this doesn't seem to be
> >> enough...
> >
> > Then maybe I didn't understand properly your setup and/or needs.  Why are
> > you being flooded and how would you know they are false positives?  "Who"
> > is the "fooled" nagios server and why?
>
> Well, flooded might have been a too strong term. Let's say I am
> receiving about 10 alerts a day.
>
> The issue here is that we have our main nagios server in Florida, and it
> is supposed to monitor servers far away, with some unreliable links,
> like in Malaysia. Also, I just need my upstream provider to "play" a bit
> with BGP (to optimize his traffic), and I get a dozen of alerts... This
> is exactly the kind of situation that I want to avoid.

Been there, seen that, and you can avoid them up to a point.  The problem is 
that due to unstability on your network sometimes you lose connection to the 
remote probes.  Well, those notifications can be avoided or at least 
minimized if you...
1) Have a means to independently check connectivity (i.e.: pinging the router 
at the remote end); then you make your remote hosts/services depend on that 
probe, so if the probe fails, then you are not bogged with false positives.
2) Increase the number of tests in order to declare the failure state 
as "HARD" (which is what triggers notification).

As an example we have, among others, a little branch office with unreliable 
connection and without a nagios server within, so the short five/six probes 
we have there, both depend on pinging the remote router (and then, 
this "ping" has higher levels to rise WARNING and CRITICAL conditions than 
pinging on a local network) and have max_check_attempts rised from "3" (our 
standard for local probes) up to "10".

This way, we greatly limited false positives.  Of course, this comes at a 
price: while on "network outage" mode we won't be noticed if there's a real 
problem on one remote service (but how our central server would know, no 
matter if Nagios or something else, if there's no avaliable path to send the 
info?).  And even if the network is available we will know about the problem 
in ten minutes instead of the standard three.

Since this is good enough for us, that's all we did on this branch office.  
But there's a path of increased reliability if need arises:

First step would be install a local satellite in the branch office; this way 
we wouldn't need to relax those probes, since the local station is near 
enough to the tested services to be more aggresive to reach a CRITICAL status 
(i.e.: when pinging remotely we need to accept that losing 25% of ICMP 
packets is "good enough"; on a local LAN even losing 5% would rise an alert).

Of course, if our remote link is down, there's no way for the status changes 
to reach the central server, so they would be unnotice (of course we could 
rise an alert related to the link itself, but this would put us on square one 
again, since we already stated the link to be unreliable).  In those cases, 
we could add a GSM modem to the remote Nagios station so for critical 
services (usually a small fraction of all monitored on a branch) an alert 
will be sent no matter what (we retain central monitoring/alerting too for 
double covering -I already told those to be really critical services, so 
being noticed twice in those cases is acceptable for us).

>
> It is just that I would need 2 servers to monitor each others. Having
> only one server wouldn't be reliable.

There's documentation on high avaliability deployment for Nagios (see 
http://nagios.sourceforge.net/docs/3_0/redundancy.html) but in my opinion, is 
too cumbersome and not really needed (for us, at least) since having a 
central server and satellites, it is this the one that "spies" the remotes, 
both by direct probing (ping and making sure that the nagios process is up 
using NRPE as a helper) and through passive services "freshness".

We have thought about doubling the central nagios server but we found nagios 
to be very reliable so we didn't manage to make a bussiness case for it.  
It's one of those things on the "nice to have" category that we'll do if 
(miracolously) we have some spare time.

> >
> > For a remote site, specially with unstable network connections, then yes,
> > you should deploy a nagios "satellite" on each location as per the
> > documentantion you already saw.  The only uneeded nagios portion on the
> > remote location would be the web interface.

If you "own" the branches that's good enough.  In our case, monitoring remote 
networks in lieu of a third party we found that giving read-only access to 
the local web server is the easiest way to give a client access to their own 
boxes (ACL management of Nagios is still very lacking).

>
> Ok, got it. That is good, as I didn't want to run a web server there.
> What's the minimum footprint, in terms of memory usage, for such
> satelite? The smaller the better, of course, but I don't want it to swap
> either...

I think that unless you go the embedded hardware route you will find quite 
difficult to find some reliable enough hardware without enough muscle while, 
of course, that'll will depend on the number of probes.

As an example, we have deployed some 250 probes for about 50 hosts on an old 
PIII with 196MB RAM; it doesn't swap, average load is about 0.25, and it 
is "programmed" to probe around all services in 5 minutes.

There are some "hints" for large deployments too (basically, avoid Perl or at 
least use the embedded interpreter, have plenty of RAM and tweak a bit the 
main engine), but unless you are pressed on deployment time and will have 
probes on the thousands, I'd avoid trying to "overoptimize" first and I'd 
focus on "plain" configuration aspects (a deployment that will avoid those 
cumbersome false positives, intelligent use of templates, so it's easier to 
monitor new hosts/services, etc.) since they'll surely be your time sinks 
(it's true nagios config is not the more elegant ever designed, tends to 
redundancy and the need to "touch" things on some three/four places in order 
to let it work, but in the end, it does its work nicely and the new features 
on version 3.x are a real advance in this regard).

> Thanks for all the above. This really helps and saves time to have your
> view on it. I'll ask the person in charge in our organization to do this
> (I try to avoid touching Nagios, I hate it's configuration format that
> often leads to errors).

Surely it asks for not a short time in order to be proficient with it, but as 
I already said, version 3.x is better and once you get "the feeling" of it, 
everything tends to fit nicely.
-- 
SALUD,
Jesús
***
jesus.navarro@undominio.net
***

Reply to:

References:
- Distributed monitoring
  - From: Thomas Goirand <thomas@goirand.fr>
- Re: Distributed monitoring
  - From: "Jesús M. Navarro" <jesus.navarro@undominio.net>
- Re: Distributed monitoring
  - From: Thomas Goirand <thomas@goirand.fr>

Prev by Date: Re: Distributed monitoring
Next by Date: pam-pgsql and limit queries to UID >= 1000
Previous by thread: Re: Distributed monitoring
Next by thread: Yuliadri Aad invites you to connect
Index(es):
- Date
- Thread