Storage performance degradation
Incident Report for Flying Circus
We have finished our maintenance to avoid further network crashes on storage hosts.
Posted Nov 12, 2020 - 23:03 CET
We have identified the cause of the driver crash and will deploy counter measures on the affected servers after 20:00 CET tonight. This will require reboots of multiple servers and updating a BIOS setting. This may lead to 4 events of increased storage latency while doing that over a period of a few hours.
Posted Nov 12, 2020 - 18:47 CET
We experienced a double-fault in our storage layer that caused IO to be blocked to protect data until sufficient redundancy was given.

The first larger issue was a crashed network driver on a storage server which caused redistribution of the data in the cluster. The second fault appears to be a software bug in the storage software that caused one disk to enter a "ghost" state: it participated in the cluster but did not properly accept the redistributed cluster.

This did not affect all services equally, but a subset of VMs on the "HDD" that were affected by the second fault as well as the first fault. Other services may have been affected due to dependencies or temporary load spikes during recovery.

We are currently monitoring the situation and will work on future protections against those faults later.
Posted Nov 12, 2020 - 17:59 CET
We have identified the issue and working on to solve the problem.
Posted Nov 12, 2020 - 17:37 CET
We are currently experiencing a performance degradation in our storage cluster. We see customer services affected with services timing out.
Posted Nov 12, 2020 - 17:12 CET
This incident affected: VM storage cluster.