Storage Outage

Incident Report for Flying Circus

Resolved

The cluster has been behaving normally since our last message and we're considering the incident to be completed.

We will review the incident at a later time to identify potential actions to avoid similar incidents in the future.

We're sorry for any inconvenience and are available through the usual channels if you have any questions.

Posted Jun 15, 2022 - 14:11 CEST

Monitoring

The cluster is fully operational again.

We experienced a sudden disk usage spike on a few disks that caused Ceph to enter a protective state to avoid data loss. We adjusted Ceph parameters accordingly to unblock the cluster, identified the task causing the high usage and ensure that it finished without blocking the cluster again. We are currently cleaning up the surplus data that caused the spike to ensure regular operational capacity.

Posted Jun 15, 2022 - 12:16 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 15, 2022 - 12:05 CEST

Update

We are continuing to investigate this issue.

Posted Jun 15, 2022 - 12:04 CEST

Investigating

Some parts of the storage cluster are currently not working properly. We look into it and keep you posted.

Posted Jun 15, 2022 - 11:55 CEST

This incident affected: RZOB (production) (VM servers, VM storage cluster).