Outage

Incident Report for Flying Circus

Postmortem

We have performed an in-depth analysis of the data loss and extended recovery time last week.

You can get the detailed post-mortem from our website: http://flyingcircus.io/postmortems/13266.pdf

Posted Mar 19, 2014 - 15:35 CET

Resolved

All customer VMs have been restored since about 10:00 CET.

Some internal machines are still being restored, but overall service level is back to normal.

We apologize for this extreme event and we're starting work on an extensive review about our processes and tools that caused this and last week's outage. We have a couple of ideas already on our list and will update this incident with more in-depth information later.

Posted Mar 11, 2014 - 13:09 CET

Update

Almost all production VMs have been restored since around 01:00 CET.

We're currently struggling with a backup inconsistency, having to dig deeper to get a good restore, but overall things are looking OK.

Posted Mar 11, 2014 - 05:47 CET

Update

Our mail server is back. Communication with our support flows again.

Posted Mar 10, 2014 - 19:57 CET

Update

Restore is continuing smoothly.

The actual number of lost disks was probably around the 50% mark, maybe even less: many services survived the interruption without any issues - despite our initial worst-case estimate.

Posted Mar 10, 2014 - 17:53 CET

Update

Our restore processes have started and the VMs are recovering. We are initially going to restore central services (mail) and customers with SLA requirements.

We also noticed that not all VMs were affected by the bug - likely due to a timeout causing some of the massive list of deletions to not be executed.

Posted Mar 10, 2014 - 16:37 CET

Update

The issue also currently limits our ability to answer to email. We are starting to contact all our customers directly.

Before starting the restore process, we are currently removing the bug that caused the outage to avoid any further incidents.

Posted Mar 10, 2014 - 15:20 CET

Identified

We identified the issue of the current outage. We are experiencing a massive data loss after a bug in our management code was triggered when reactivating one of the old storage servers in a foreign location.

Currently we are taking inventory of the damage and preparing for disaster recovery and reinstallation. We'll follow up with a more detailed plan shortly.

Posted Mar 10, 2014 - 15:03 CET

Investigating

We're seeing disk errors on multiple VMs causing overall service outages. We're currently looking into the issue.

Posted Mar 10, 2014 - 14:48 CET