Resolved -
Our fix has been rolled out over the last hours and has been stable.
We've removed our workarounds from early this morning and our telemetry shows no performance impacts after removing the workarounds, thus confirming our fix.
Please note: one of the workarounds was to disable processing of configuration management events. A number of events (like changes to VM resources, maintenance mails, some internal DNS records) have backed up over the day and have been processed in the last 15 minutes.
Apr 22, 17:44 CEST
Update -
We have implemented a long-term fix for tonight's issue and are preparing to roll it out in our production cluster over the next hours.
This will happen in a staggered fashion, host by host, and will be transparent for our customers. We will disable the the short-term workarounds that we implemented earlier today when the rollout shows that our long-term fix is holding up.
Apr 22, 12:15 CEST
Update -
We are implementing and testing a structural fix for the storage management software component.
Apr 22, 06:15 CEST
Update -
We are continuing to work on a fix for this issue.
Apr 22, 06:14 CEST
Update -
All VMs are back online. We are verifying the individual services now.
Apr 22, 06:13 CEST
Update -
There are still some VMs offline. We are working on it.
Apr 22, 05:53 CEST
Update -
The storage server traffic is back to normal now. We now have a look a the affected VMs and services.
Apr 22, 05:04 CEST
Update -
We are still in the process of preventing the calls.
Apr 22, 04:53 CEST
Identified -
An expensive metadata call seems to be the issue. We are about to prevent the specific call.
Apr 22, 04:28 CEST
Update -
We are continuing to investigate this issue.
Apr 22, 04:10 CEST
Investigating -
A single OSD ("Disk") shows unusual data transfer behaviour causing VMs which use the OSD to slow down significantly.
Apr 22, 03:59 CEST