Re: Distributed monitoring

On Sun, Mar 29, 2009 at 01:09:46AM +0800, Thomas Goirand wrote:
> The issue here is that we receive so many monitoring alerts that it
> becomes useless. It happened once already that a server really had an
> issue, and because of the flood of alerts, we really realized it was
> down a bit late (45 minutes to 1 hour of down time, which is already
> unacceptable when only a quick reboot using our remote tools was enough
> to solve the issue...). Also, because of the number of alerts and the
> fact they are unreliable (many false positive), we can't use our email
> to SMS gateway to send us alerts.

This sounds like a prima facie case for using matilda(1)[0]

> So what I wanted to have is something where multiple nagios server (or
> another product) would check if a given server is down, and if BOTH are
> reporting failure, then an alert is triggered. We could set up to let's
> say 5 nagios server or something, distributed in different locations.

Pipe the mails to a script which uses diff(1) on each, excluding, the
appropriate headers; if they all match, do nothing; if there are
differences, mail?

[0] <http://ex-parrot.com/~chris/software.html#matilda> (but please note

``Large increases in cost with questionable increases in performance
  can be tolerated only in racehorses and fancy women.'' (Lord Kelvin)

