Partial storage server outage

Incident Report for Flying Circus

Resolved

Our fix has been rolled out over the last hours and has been stable.

We've removed our workarounds from early this morning and our telemetry shows no performance impacts after removing the workarounds, thus confirming our fix.

Please note: one of the workarounds was to disable processing of configuration management events. A number of events (like changes to VM resources, maintenance mails, some internal DNS records) have backed up over the day and have been processed in the last 15 minutes.

Posted Apr 22, 2025 - 17:44 CEST

Update

We have implemented a long-term fix for tonight's issue and are preparing to roll it out in our production cluster over the next hours.

This will happen in a staggered fashion, host by host, and will be transparent for our customers. We will disable the the short-term workarounds that we implemented earlier today when the rollout shows that our long-term fix is holding up.

Posted Apr 22, 2025 - 12:15 CEST

Update

We are implementing and testing a structural fix for the storage management software component.

Posted Apr 22, 2025 - 06:15 CEST

Update

We are continuing to work on a fix for this issue.

Posted Apr 22, 2025 - 06:14 CEST

Update

All VMs are back online. We are verifying the individual services now.

Posted Apr 22, 2025 - 06:13 CEST

Update

There are still some VMs offline. We are working on it.

Posted Apr 22, 2025 - 05:53 CEST

Update

The storage server traffic is back to normal now. We now have a look a the affected VMs and services.

Posted Apr 22, 2025 - 05:04 CEST

Update

We are still in the process of preventing the calls.

Posted Apr 22, 2025 - 04:53 CEST

Identified

An expensive metadata call seems to be the issue. We are about to prevent the specific call.

Posted Apr 22, 2025 - 04:28 CEST

Update

We are continuing to investigate this issue.

Posted Apr 22, 2025 - 04:10 CEST

Investigating

A single OSD ("Disk") shows unusual data transfer behaviour causing VMs which use the OSD to slow down significantly.

Posted Apr 22, 2025 - 03:59 CEST

This incident affected: RZOB (production) (VM storage cluster).