[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: alioth is down (again)



On Sun, Jan 29, 2012 at 2:15 PM, Stephen Gran <sgran@debian.org> wrote:
> Hello,
>
> Unfortunately, one of the pair of machines providing the alioth service
> (vasks.debian.org) won't power on.  We are working on it, and apologize
> for any inconvenience caused.
>
> Cheers,


Hello,

  First, thanks from a plain user, to all people working to provide
infrastructure to Debian project.

  Second, a suggestion or some brain dump about ideas on howto improve
the issues communication:

  I imagine the scenario, where some DD is trying to work from any
place in the world. Nowadays, there are many points to check if a
service is not working... is it my last upgrade? is it my last config
change? is it my ISP? is it some intermediate ISP? is it the service
that is really down? of course, this kind of email notifications are
just fine to notify a known issue.

   So I ask myself... the reason to do not run "any" public monitoring
system, is much increase in the workload of the sysadmins ?  There are
different approaches to do it...

   approach one)  Run a public nagios, monit, whatever, configured
with templates to notify to this list on defined events (i.e. more
than 10 minutes down? the service, the DNS, the whole machine, the
whole network? is service recovered again?

   approach two)  Search across available opensource monitoring
systems, some than can run some "status.debian.org", so instead of
emails, users having an issue can lookup such dashboard, and see
present and past status or issues.

    approach three)  Write a fast and furious bash/perl/python script
(can be cool to just use priority >= standard or as few depends as
possible), that takes a debian.org/infrastructure.yaml file (or .json
or .txt or xml or ...) that defines Debian machines and services...
the CLI client runs against such file (so it diagnoses that network
connection to d.o is ok in first instance) and prints a report of
unreachable services... (one run, one check. So no too much overload
unless lot of users synchronize a DoS, that can be done with or
without this tool).

    approach four) Search or write a distributed monitoring service,
that provides the "one" or "two" approaches, but from different
geolocalized places, so after detect that a service/machine is down
"from here", it tries to communicate with other continents monitoring
systems and contrast results before "validate" the issue.

    approach five) ... sure that people more clever than me, can
propose better solutions, to automate issue notification and
tracking... please do!

    This is not one big neither important, "improvement front", to
Debian, these are just suggestions on ideas to improve the process,
from my personal view point, that of course maybe plainly wrong from
outside the project.  I can just help with details on the ideas, with
code if needed, and collaborate from my home aDSL to distributed
monitoring in case it's needed, but I think that my home connection
fails more often than Debian machines do.

    Again thanks to every people doing the work.


--
Iñigo


Reply to: