Postmortem analysis of Debian at Bytemark being down

From the sometimes-we-get-stuff-done-before-coffee department:

Executive summary: Things are back.

In the UK, Bytemark hosts a full center of gear for Debian to run a
ganeti cluster which in turn runs a significant portion of Debian's
infrastructure [ldapsearch].  Two of the blades also act as routers,
with routing duties being configured in a failover manner.

We upgraded the machines which host the ganeti cluster at Bytemark to
stretch yesterday and things seemed to be mostly going straight
forward.  There was a minor issue with keepalived, the software
ensuring one of the two routing blades actually has the gateway IP
addresses, but it was thought that was resolved quickly.

The new keepalive in stretch appears to no longer handle both IPv4 and
IPv6 addresses in the same VRRP instance, so we simply split the one
instance into two, and that appeared to work.  Unfortunately, we failed
to also update a script that manages handling default routes to accept
the new names for the VRRP instances.

Over night, we also did a scheduled mass reboot of all the blades to
adapt some kernel parameters.  This is a scripted procedure, running
several hours, which moves virtual instances around and reboots blades
when empty.  All went well, with the minor exception of the active
routing instance, which rebooted at about 0200Z, came back, activated
its gateway addresses, and then failed to set its default gateway.  As
a result, technically, all services were still reachable, they just
couldn't get their answers out. :)

The script has been updated, and all services restored as of shortly
after 0600Z.

We would also like to thank Bytemark and their team, both for
sponsoring us in the first place, but also for helping us locate the
issue quickly.


[ldapsearch] ldapsearch -h db.debian.org -x -ZZ -b dc=debian,dc=org -LLL 'physicalHost=ganeti.bm.debian.org' hostname purpose



