We have received feedback from our data center operator about the outage.
One practical conclusion from our side is that customer setups not yet leveraging IPv6 can further improve their resiliency in this specific situation. We will contact customers without dual stack setups (concurrently running IPv4 and IPv6 on the frontends) to implement dual stack where possible.
Even though this outage has only been partial, it has affected our customers for roughly an hour - which is longer than we strive for on this level of infrastructure. Our data center SLAs are targetting 99.99% – which on average allows for about 4 minutes of downtime during a month or 53 minutes a year - which hasn’t been the case here.
We are generally closely in touch with data center personnel during critical situations like this. However, initially we had a hard time reaching them as their phone system was affected as well. We have been provided with a separate backup number for future incidents. Additionally we are still discussing options to improve the “time to recovery” in future situations like this to stay true to the goal of 99.99% availability on the data center infrastructure level.