Rebooting and unresponsive VMs

Incident Report for Flying Circus

Postmortem

Last Wednesday (2017-08-02) we experienced a series of unresponsive or rebooting VMs due to an escalated number of live migrations in our virtualisation infrastructure.

Our analysis showed that the events leading to this began on the evening before when we added a new public IPv4 network to our infrastructure.

This action is done in our inventory system and propagates to our machines within a few minutes. We watched our systems closely in the hour after the change but did not see any adverse effects.

But: our networking infrastructure uses "policy routing" to ensure LAN-level performance when multiple machines talking to each other may be using different IP networks on the same LAN segment. Using different networks is a result of global IPv4 exhaustion which requires us to pack public IPs as compact as possible and use multiple smaller networks that can not be aggregated into larger networks when new addresses are needed. Services that grow over time may thus be using IPs from multiple generations of networks. If those want to talk to each other they could do so by passing the traffic through our router. This has, however, a small performance penalty that some applications had issues with a few years ago.

Since then we have already switched to using a large private IPv4 network instead of public IPv4 addresses for new machines which eliminates this issue for the foreseeable future. Also, using IPv6 for talking within the data center eliminates this for a long time.

Nevertheless, the existing policy routing infrastructure requires all machines (physical and virtual) to be informed about all networks that exist in our environment. Changing this configuration (by adding a new network) causes a maintenance job to be scheduled for every machine (at a convenient time) because a service interruption of a few seconds is required to reconfigure the network interfaces.

This reconfiguration is automatically scheduled for a short window of 1 minute. In this incident, our KVM servers scheduled their maintenance windows for the next morning. However, KVM servers require some time to first evacuate all VMs before performing potentially disruptive maintenance. In the case of large and busy VMs this can even take hours. The schedule however, planned for all virtualisation servers to perform their maintenances consecutively within 2 hours. This resulted in too many KVM servers trying to evacuate at the same time. In general we protect against this, but we are currently working on a race condition where virtual machines can require memory on their new target host while it has not yet outmigrated other machines that should not be using this memory any longer.

We already knew about this issue and had thus stopped performing code releases that require large scale maintenances without proper manual observance. This time we did not manage to expect this behaviour as usual changes in our inventory do not have such wide spread effects.

The situation was quickly stabilized by stopping our automation and manually cleaning up all virtualization hosts over the day.

After reviewing this scenario we have decided for the following future platform improvements:

Virtualization server maintance scripts will receive a "safety seatbelt" that completely prohibits automated maintenance scheduling without manual intervention. This avoids the escalating buildup of issues.
Our maintenance duration prediction algorithm will include evacuation time of KVM hosts to reduce parallel stress that can be adequately predicted.
We will remove our policy routing infrastructure as we have more modern approaches available now that do not require it. This reduces overall network complexity and will reduces the requirement of global changes when adding or removing networks making them more robust.
We have originally not planned to upgrade all our existing virtualization servers with 10 GB ethernet and just let this organically grow whenever new machines are added. We are going to review this decision to support faster migrations that will have less impact on running services and will thus also reduce the impact of such large scale maintenance periods.

We apologize for the inconvenience and are grateful for your continued trust of delivering a highly reliable platform and service.

As always, we welcome your feedback: send me an email to ct@flyingcircus.io if you want to get in touch.

Christian Theune CEO / Founder

Posted Aug 08, 2017 - 13:10 CEST

Resolved

All KVM hosts have successfully applied their maintenance. A single VM experienced a reboot during the migrations last night, but otherwise everything went smooth. We have rebalanced VMs in the cluster this morning for even load distribution and are back to normal.

We're sorry for any inconvenience and will provide a post mortem in the next days.

Posted Aug 03, 2017 - 12:09 CEST

Monitoring

We're sorry for the long time since the last update. We have stabilized the situation when we noticed the issue this morning. At the moment we are manually performing the required maintenance windows for all KVM hosts while ensuring that no accidental overload happens again. We are about half way through at the moment and see regular behaviour on all systems. We will update the issue once we're completely done.

Posted Aug 02, 2017 - 14:45 CEST

Identified

Today several VMs rebooted unexpectedly, or where unresponsive.

Several VM hosts started maintenance at the same time. This caused the remaining hosts to be overloaded. We are reviewing the situation now. Some VMs will be slower than usual.

Posted Aug 02, 2017 - 09:22 CEST