From the sometimes-we-get-stuff-done-before-coffee department: Executive summary: Things are back. In the UK, Bytemark hosts a full center of gear for Debian to run a ganeti cluster which in turn runs a significant portion of Debian's infrastructure [ldapsearch]. Two of the blades also act as routers, with routing duties being configured in a failover manner. We upgraded the machines which host the ganeti cluster at Bytemark to stretch yesterday and things seemed to be mostly going straight forward. There was a minor issue with keepalived, the software ensuring one of the two routing blades actually has the gateway IP addresses, but it was thought that was resolved quickly. The new keepalive in stretch appears to no longer handle both IPv4 and IPv6 addresses in the same VRRP instance, so we simply split the one instance into two, and that appeared to work. Unfortunately, we failed to also update a script that manages handling default routes to accept the new names for the VRRP instances. Over night, we also did a scheduled mass reboot of all the blades to adapt some kernel parameters. This is a scripted procedure, running several hours, which moves virtual instances around and reboots blades when empty. All went well, with the minor exception of the active routing instance, which rebooted at about 0200Z, came back, activated its gateway addresses, and then failed to set its default gateway. As a result, technically, all services were still reachable, they just couldn't get their answers out. :) The script has been updated, and all services restored as of shortly after 0600Z. We would also like to thank Bytemark and their team, both for sponsoring us in the first place, but also for helping us locate the issue quickly. Cheers, [ldapsearch] ldapsearch -h db.debian.org -x -ZZ -b dc=debian,dc=org -LLL 'physicalHost=ganeti.bm.debian.org' hostname purpose -- bye, pabs https://wiki.debian.org/PaulWise
Attachment:
signature.asc
Description: This is a digitally signed message part