Storage Outage
Incident Report for Flying Circus
Resolved
The cluster has been behaving normally since our last message and we're considering the incident to be completed.

We will review the incident at a later time to identify potential actions to avoid similar incidents in the future.

We're sorry for any inconvenience and are available through the usual channels if you have any questions.
Posted Jun 15, 2022 - 14:11 CEST
Monitoring
The cluster is fully operational again.

We experienced a sudden disk usage spike on a few disks that caused Ceph to enter a protective state to avoid data loss. We adjusted Ceph parameters accordingly to unblock the cluster, identified the task causing the high usage and ensure that it finished without blocking the cluster again. We are currently cleaning up the surplus data that caused the spike to ensure regular operational capacity.
Posted Jun 15, 2022 - 12:16 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 15, 2022 - 12:05 CEST
Update
We are continuing to investigate this issue.
Posted Jun 15, 2022 - 12:04 CEST
Investigating
Some parts of the storage cluster are currently not working properly. We look into it and keep you posted.
Posted Jun 15, 2022 - 11:55 CEST
This incident affected: RZOB (production) (VM servers, VM storage cluster).