Our statistics show that performance should be back to normal since around 05:00 CEST.
There's a little remaining recovery traffic ongoing without impact on production performance.
Posted 10 months ago. Oct 14, 2017 - 08:07 CEST
We have restarted all Ceph daemons as a measure to counter potential issues that resulted from the changed network changes "on the fly". We currently see a lot of recovery traffic in the cluster thus resulting in slow requests, but see the situation improving. We'll keep an eye on the situation for a little longer.
Posted 10 months ago. Oct 14, 2017 - 02:31 CEST
The amount of slow requests has been reduced, but performance is still under degraded performance.
We're seeing that all customer applications are generally performing their functional duties but also exhibit reduced performance, depending on the application's reliance on raw disk performance.
We're expecting this to improve over the next hours as the recovery in the story cluster progresses.
Posted 10 months ago. Oct 14, 2017 - 02:02 CEST
We're still seeing stuck requests and are analyzing the situation.
Posted 10 months ago. Oct 14, 2017 - 00:48 CEST
We've finished configuring the jumbo frames and are currently cleaning up expected stuck requests in the Ceph cluster. We'll update here once done.
Posted 10 months ago. Oct 14, 2017 - 00:31 CEST
Maintenance is progressing, albeit a bit more slowly than anticipated. We're entering the last phase in which we'll activate jumbo frames in the next minutes and expect to be finished until 00:30 CEST.
Posted 10 months ago. Oct 13, 2017 - 23:51 CEST
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted 10 months ago. Oct 13, 2017 - 22:00 CEST
We are going to further improve our network configuration for enhanced reliability and performance. This will affect all networks but specifically will increase the responsiveness of our storage cluster during periods of recovery.
The network changes include some settings (like "Ethernet Flow Control" and "Jumbo Frames") that will cause intermittent connectivity issues on all VLANs. We expect multiple short interruptions in the range of 10 seconds for individual servers applying their settings and one longer 15 minute interruption when switching the storage network to jumbo frames.