Re: Distributed monitoring
On Sunday 29 March 2009 18:39:33 Thomas Goirand wrote:
> Jesús M. Navarro wrote:
> >> First of all, yes, we do have implemented topology and dependencies, and
> >> we do not receive UNKNOWN status already. But this doesn't seem to be
> >> enough...
> > Then maybe I didn't understand properly your setup and/or needs. Why are
> > you being flooded and how would you know they are false positives? "Who"
> > is the "fooled" nagios server and why?
> Well, flooded might have been a too strong term. Let's say I am
> receiving about 10 alerts a day.
> The issue here is that we have our main nagios server in Florida, and it
> is supposed to monitor servers far away, with some unreliable links,
> like in Malaysia. Also, I just need my upstream provider to "play" a bit
> with BGP (to optimize his traffic), and I get a dozen of alerts... This
> is exactly the kind of situation that I want to avoid.
Been there, seen that, and you can avoid them up to a point. The problem is
that due to unstability on your network sometimes you lose connection to the
remote probes. Well, those notifications can be avoided or at least
minimized if you...
1) Have a means to independently check connectivity (i.e.: pinging the router
at the remote end); then you make your remote hosts/services depend on that
probe, so if the probe fails, then you are not bogged with false positives.
2) Increase the number of tests in order to declare the failure state
as "HARD" (which is what triggers notification).
As an example we have, among others, a little branch office with unreliable
connection and without a nagios server within, so the short five/six probes
we have there, both depend on pinging the remote router (and then,
this "ping" has higher levels to rise WARNING and CRITICAL conditions than
pinging on a local network) and have max_check_attempts rised from "3" (our
standard for local probes) up to "10".
This way, we greatly limited false positives. Of course, this comes at a
price: while on "network outage" mode we won't be noticed if there's a real
problem on one remote service (but how our central server would know, no
matter if Nagios or something else, if there's no avaliable path to send the
info?). And even if the network is available we will know about the problem
in ten minutes instead of the standard three.
Since this is good enough for us, that's all we did on this branch office.
But there's a path of increased reliability if need arises:
First step would be install a local satellite in the branch office; this way
we wouldn't need to relax those probes, since the local station is near
enough to the tested services to be more aggresive to reach a CRITICAL status
(i.e.: when pinging remotely we need to accept that losing 25% of ICMP
packets is "good enough"; on a local LAN even losing 5% would rise an alert).
Of course, if our remote link is down, there's no way for the status changes
to reach the central server, so they would be unnotice (of course we could
rise an alert related to the link itself, but this would put us on square one
again, since we already stated the link to be unreliable). In those cases,
we could add a GSM modem to the remote Nagios station so for critical
services (usually a small fraction of all monitored on a branch) an alert
will be sent no matter what (we retain central monitoring/alerting too for
double covering -I already told those to be really critical services, so
being noticed twice in those cases is acceptable for us).
> It is just that I would need 2 servers to monitor each others. Having
> only one server wouldn't be reliable.
There's documentation on high avaliability deployment for Nagios (see
http://nagios.sourceforge.net/docs/3_0/redundancy.html) but in my opinion, is
too cumbersome and not really needed (for us, at least) since having a
central server and satellites, it is this the one that "spies" the remotes,
both by direct probing (ping and making sure that the nagios process is up
using NRPE as a helper) and through passive services "freshness".
We have thought about doubling the central nagios server but we found nagios
to be very reliable so we didn't manage to make a bussiness case for it.
It's one of those things on the "nice to have" category that we'll do if
(miracolously) we have some spare time.
> > For a remote site, specially with unstable network connections, then yes,
> > you should deploy a nagios "satellite" on each location as per the
> > documentantion you already saw. The only uneeded nagios portion on the
> > remote location would be the web interface.
If you "own" the branches that's good enough. In our case, monitoring remote
networks in lieu of a third party we found that giving read-only access to
the local web server is the easiest way to give a client access to their own
boxes (ACL management of Nagios is still very lacking).
> Ok, got it. That is good, as I didn't want to run a web server there.
> What's the minimum footprint, in terms of memory usage, for such
> satelite? The smaller the better, of course, but I don't want it to swap
I think that unless you go the embedded hardware route you will find quite
difficult to find some reliable enough hardware without enough muscle while,
of course, that'll will depend on the number of probes.
As an example, we have deployed some 250 probes for about 50 hosts on an old
PIII with 196MB RAM; it doesn't swap, average load is about 0.25, and it
is "programmed" to probe around all services in 5 minutes.
There are some "hints" for large deployments too (basically, avoid Perl or at
least use the embedded interpreter, have plenty of RAM and tweak a bit the
main engine), but unless you are pressed on deployment time and will have
probes on the thousands, I'd avoid trying to "overoptimize" first and I'd
focus on "plain" configuration aspects (a deployment that will avoid those
cumbersome false positives, intelligent use of templates, so it's easier to
monitor new hosts/services, etc.) since they'll surely be your time sinks
(it's true nagios config is not the more elegant ever designed, tends to
redundancy and the need to "touch" things on some three/four places in order
to let it work, but in the end, it does its work nicely and the new features
on version 3.x are a real advance in this regard).
> Thanks for all the above. This really helps and saves time to have your
> view on it. I'll ask the person in charge in our organization to do this
> (I try to avoid touching Nagios, I hate it's configuration format that
> often leads to errors).
Surely it asks for not a short time in order to be proficient with it, but as
I already said, version 3.x is better and once you get "the feeling" of it,
everything tends to fit nicely.