VM performance and stability issues

Incident Report for Flying Circus

Resolved

All global and individual VM issues have been resolved and all customer services are back to normal.

Some VMs needed additional attention as we have seen new error modes that we currently attribute to the temporary extreme overload of the virtualisation servers. In those cases VMs got stuck for a longer time and needed a manual review, restart and in very few cases some manual fixes from us.

Sorry for the inconvenience - the initial overload was caused by a defect in one of our algorithms which will be fixed ASAP.

Posted Sep 16, 2015 - 01:19 CEST

Update

We manage to redistribute VMs after the unclean rebalancing and are monitoring the situation to fully settle.

Some VMs have been restarted, either by watchdog interruption due to unresponsiveness, or as an emergency measure from our side (if they were non-production VMs). Some of those are currently still rebooting as they are performing intensive disk checks.

Posted Sep 15, 2015 - 23:38 CEST

Identified

We have noticed stability and performance issues in our virtualization layer. It appears that a number of automatic live migrations caused an pathologically uneven distribution of VMs on our hosts, causing massive swapping which in turn causes slowness and stability issues.

We are currently working towards rebalancing the VMs to get back into a stable situation and to provide a bugfix for the balancing algorithm to avoid this.

Posted Sep 15, 2015 - 22:37 CEST