Storage performance degradation

Incident Report for Flying Circus

Resolved

We have finished our maintenance to avoid further network crashes on storage hosts.

Posted Nov 12, 2020 - 23:03 CET

Update

We have identified the cause of the driver crash and will deploy counter measures on the affected servers after 20:00 CET tonight. This will require reboots of multiple servers and updating a BIOS setting. This may lead to 4 events of increased storage latency while doing that over a period of a few hours.

Posted Nov 12, 2020 - 18:47 CET

Monitoring

We experienced a double-fault in our storage layer that caused IO to be blocked to protect data until sufficient redundancy was given.

The first larger issue was a crashed network driver on a storage server which caused redistribution of the data in the cluster. The second fault appears to be a software bug in the storage software that caused one disk to enter a "ghost" state: it participated in the cluster but did not properly accept the redistributed cluster.

This did not affect all services equally, but a subset of VMs on the "HDD" that were affected by the second fault as well as the first fault. Other services may have been affected due to dependencies or temporary load spikes during recovery.

We are currently monitoring the situation and will work on future protections against those faults later.

Posted Nov 12, 2020 - 17:59 CET

Identified

We have identified the issue and working on to solve the problem.

Posted Nov 12, 2020 - 17:37 CET

Investigating

We are currently experiencing a performance degradation in our storage cluster. We see customer services affected with services timing out.

Posted Nov 12, 2020 - 17:12 CET

This incident affected: RZOB (production) (VM storage cluster).