Storage and VM outage

Incident Report for Flying Circus

Resolved

The storage is fine for several hours now and we have enabled the backup system fully now.

We apologize for the problems we have caused you. We will publish a detailed analysis of the incident in a few days.

Posted Nov 23, 2016 - 13:11 CET

Update

Deleting the snapshots takes considerably longer than expected. We will turn backup in the morning.

Posted Nov 23, 2016 - 01:22 CET

Update

We applied the bugfix successfully so far. The storage cluster is deleting pending snapshots now. Once all to-be-deleted snapshots are gone, we will enable the backup system again.

Posted Nov 22, 2016 - 22:36 CET

Update

We have developed a bugfix for the current Ceph bug and carefully re-enacted the situation in our development cluster. We were able to fix the existing issues reliably and without any service interruption.

Tonight at 20:30 CET we will apply the bugfix to our production cluster and re-enable backups after that.

Posted Nov 22, 2016 - 16:23 CET

Update

Yesterday, we probably have found a way around the Ceph bug. We are in contact with a consultant to review the situation.

Posted Nov 21, 2016 - 09:35 CET

Update

Due to the Ceph bug prohibiting proper management of snapshots we have to suspend making backups temporarily. We will continue to investigate a fix tomorrow morning which may include moving and copying VM data into a new Ceph pool, which would allow us to get rid of the bugs' impact.

Posted Nov 20, 2016 - 02:33 CET

Update

The previous blocked IOs have been caused by crashing daemons due to a bug in Ceph that we have discovered with our specific situation that stops snapshots from being deleted correctly. We have stabilized the situation for now but are still investigating a workaround to safely re-enable snapshot deletions again.

Posted Nov 20, 2016 - 01:31 CET

Update

We are occasionally see blocked IO requests leading to temporary VM freezes.

Posted Nov 19, 2016 - 16:24 CET

Monitoring

We have started all VMs, with a few exceptions we need to take a deeper look at. The services are generally available since a few hours.

Posted Nov 19, 2016 - 13:20 CET

Update

We are now bulk-starting the remaining VMs and keep an eye on the service recovery.

Posted Nov 19, 2016 - 11:36 CET

Update

We finished resolving missing objects and other Ceph-specific issues. First carefully selected VMs have been started and we have applied a deep filesystem check which found no issues.

We are now starting to slowly increase production load. A few VMs will need to be repaired with one or two blocks missing that we were able to retrieve manually from Ceph.

Posted Nov 19, 2016 - 09:14 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 19, 2016 - 06:30 CET

Update

We have very carefully analysed the lower level filesystems and found most data to be intact. We are currently trying to recover or delete a small number of missing objects that are blocking us from checking the VM's data and starting them again. A few first VMs from our SSD pool that wasn't missing objects appeared mostly unharmed and started up fine.

After reasoning about the more detailed data we got while repairing the file systems we consider the inconsistencies more likely to have been triggered by previous bugs, maybe related to disabling a controller caching feature a while ago. We decided to trust the newer long term support kernel more under those circumstances. We double-checked the filesystems on the old and new kernel and carefully fixed all inconsistencies.

We will update you again once we get the situation with the few missing objects resolved and start VMs up again.

Posted Nov 19, 2016 - 06:30 CET

Update

We suspect a kernel bug that has not been detected in any of our testing environments.

We decided for a route to try to recover existing VM data. If that doesn't work, we will revert to backups later.

Our current plan is to:

* take offline all currently running VMs
* turn off all Ceph daemons
* switch all servers back to the previous kernel version
* perform XFS filesystem check and recovery on all Ceph data partitions
* start Ceph daemons up again
* start individual VMs to check for their filesystem consistency
* monitor the system for stability

After that we will decide whether stability and consistency are satisfying and either move forward or turn to restoring from backup.

Posted Nov 19, 2016 - 01:00 CET

Update

The underlying filesystems of our Ceph cluster are showing major corruption. We are currently investigating our options for repair or whether to restore from backup.

Posted Nov 18, 2016 - 23:56 CET

Investigating

After the kernel updates, our storage servers started normally but have begun to crash with filesystem errors. We are investing the issue at the moment, carefully not to destroy data.

Posted Nov 18, 2016 - 23:07 CET

Monitoring

Our storage and VM servers are applying their kernel updates and are rebooted. This causes intermittent reduced performance.

Posted Nov 18, 2016 - 20:08 CET