Storage cluster issues in RZOB

Incident Report for Flying Circus

Resolved

This incident has been resolved.
Posted Aug 05, 2025 - 19:35 CEST

Update

We found a stuck storage daemon which Ceph did not properly remove from the cluster. Deactivating the stuck daemon unblocked the traffic immediately.

The direct impact was limited to VMs running in the SSD-class pool. As those typically are used for databases this had more widespread impact, especially on complex services leveraging clusters of VMs even if only one of VMs was affected.
Posted Aug 05, 2025 - 19:35 CEST

Monitoring

We've identified the issue and have taken action to resolve the problem. We're continuing to assess the situation.
Posted Aug 05, 2025 - 19:27 CEST

Investigating

We're seeing issues related to the Ceph storage cluster in RZOB.
Posted Aug 05, 2025 - 19:22 CEST
This incident affected: RZOB (production) (VM storage cluster).