Storage performance degradation
Incident Report for Flying Circus
Resolved
We have finished our maintenance to avoid further network crashes on storage hosts.
Posted Nov 12, 2020 - 23:03 CET
Update
We have identified the cause of the driver crash and will deploy counter measures on the affected servers after 20:00 CET tonight. This will require reboots of multiple servers and updating a BIOS setting. This may lead to 4 events of increased storage latency while doing that over a period of a few hours.
Posted Nov 12, 2020 - 18:47 CET
Monitoring
We experienced a double-fault in our storage layer that caused IO to be blocked to protect data until sufficient redundancy was given.

The first larger issue was a crashed network driver on a storage server which caused redistribution of the data in the cluster. The second fault appears to be a software bug in the storage software that caused one disk to enter a "ghost" state: it participated in the cluster but did not properly accept the redistributed cluster.

This did not affect all services equally, but a subset of VMs on the "HDD" that were affected by the second fault as well as the first fault. Other services may have been affected due to dependencies or temporary load spikes during recovery.

We are currently monitoring the situation and will work on future protections against those faults later.
Posted Nov 12, 2020 - 17:59 CET
Identified
We have identified the issue and working on to solve the problem.
Posted Nov 12, 2020 - 17:37 CET
Investigating
We are currently experiencing a performance degradation in our storage cluster. We see customer services affected with services timing out.
Posted Nov 12, 2020 - 17:12 CET
This incident affected: RZOB (production) (VM storage cluster).