Re: Distributed monitoring

To: Thomas Goirand <thomas@goirand.fr>
Cc: debian-isp@lists.debian.org
Subject: Re: Distributed monitoring
From: Adam McGreggor <lists@amyl.org.uk>
Date: Sat, 28 Mar 2009 17:48:37 +0000
Message-id: <[🔎] 20090328174837.GJ4537@amyl.org.uk>
In-reply-to: <[🔎] 49CE59DA.5040505@goirand.fr>
References: <[🔎] 49CE59DA.5040505@goirand.fr>

On Sun, Mar 29, 2009 at 01:09:46AM +0800, Thomas Goirand wrote:
> The issue here is that we receive so many monitoring alerts that it
> becomes useless. It happened once already that a server really had an
> issue, and because of the flood of alerts, we really realized it was
> down a bit late (45 minutes to 1 hour of down time, which is already
> unacceptable when only a quick reboot using our remote tools was enough
> to solve the issue...). Also, because of the number of alerts and the
> fact they are unreliable (many false positive), we can't use our email
> to SMS gateway to send us alerts.

This sounds like a prima facie case for using matilda(1)[0]

> So what I wanted to have is something where multiple nagios server (or
> another product) would check if a given server is down, and if BOTH are
> reporting failure, then an alert is triggered. We could set up to let's
> say 5 nagios server or something, distributed in different locations.

Pipe the mails to a script which uses diff(1) on each, excluding, the
appropriate headers; if they all match, do nothing; if there are
differences, mail?

[0] <http://ex-parrot.com/~chris/software.html#matilda> (but please note
<http://ex-parrot.com/~chris/wwwitter/20070305-chris_lightfoot_1978-2007.html>)

-- 
``Large increases in cost with questionable increases in performance
  can be tolerated only in racehorses and fancy women.'' (Lord Kelvin)

Reply to:

References:
- Distributed monitoring
  - From: Thomas Goirand <thomas@goirand.fr>

Prev by Date: Distributed monitoring
Next by Date: Yuliadri Aad invites you to connect
Previous by thread: Distributed monitoring
Next by thread: Re: Distributed monitoring
Index(es):
- Date
- Thread