[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: alioth is down (again)



On Mon, Jan 30, 2012 at 3:40 AM, Poison Bit wrote:

>   approach one)  Run a public nagios, monit, whatever, configured
> with templates to notify to this list on defined events (i.e. more
> than 10 minutes down? the service, the DNS, the whole machine, the
> whole network? is service recovered again?

I don't think it would be appropriate to notify d-d-a or d-i-a on
every service flap. Servers are already monitored:

http://dsa.debian.org/
https://nagios.debian.org/nagios3/
http://munin.debian.org/

>   approach two)  Search across available opensource monitoring
> systems, some than can run some "status.debian.org", so instead of
> emails, users having an issue can lookup such dashboard, and see
> present and past status or issues.

http://dsa.debian.org/
https://nagios.debian.org/nagios3/
http://munin.debian.org/

>    approach three)  Write a fast and furious bash/perl/python script
> (can be cool to just use priority >= standard or as few depends as
> possible), that takes a debian.org/infrastructure.yaml file (or .json
> or .txt or xml or ...) that defines Debian machines and services...
> the CLI client runs against such file (so it diagnoses that network
> connection to d.o is ok in first instance) and prints a report of
> unreachable services... (one run, one check. So no too much overload
> unless lot of users synchronize a DoS, that can be done with or
> without this tool).

I guess DSA would welcome a patch adding machine-parsable output and
status information to this:

https://db.debian.org/machines.cgi

I guess the devscripts maintainers would also welcome a script to read
the resulting info and print it out.

>    approach four) Search or write a distributed monitoring service,
> that provides the "one" or "two" approaches, but from different
> geolocalized places, so after detect that a service/machine is down
> "from here", it tries to communicate with other continents monitoring
> systems and contrast results before "validate" the issue.

Sounds like something that would be doable with nagios, I suggest you
send a patch for DSA's puppet configuration when alioth returns:

git://anonscm.debian.org/mirror/dsa-puppet.git (currently down due to
alioth being down)
http://dsa.debian.org/howto/puppet-setup/

-- 
bye,
pabs

http://wiki.debian.org/PaulWise


Reply to: