Network Maintenance with Downtime
Scheduled Maintenance Report for Flying Circus
Completed
Our statistics show that performance should be back to normal since around 05:00 CEST.

There's a little remaining recovery traffic ongoing without impact on production performance.
Posted 10 days ago. Oct 14, 2017 - 08:07 CEST
Update
We have restarted all Ceph daemons as a measure to counter potential issues that resulted from the changed network changes "on the fly". We currently see a lot of recovery traffic in the cluster thus resulting in slow requests, but see the situation improving. We'll keep an eye on the situation for a little longer.
Posted 10 days ago. Oct 14, 2017 - 02:31 CEST
Verifying
The amount of slow requests has been reduced, but performance is still under degraded performance.

We're seeing that all customer applications are generally performing their functional duties but also exhibit reduced performance, depending on the application's reliance on raw disk performance.

We're expecting this to improve over the next hours as the recovery in the story cluster progresses.
Posted 10 days ago. Oct 14, 2017 - 02:02 CEST
Update
We're still seeing stuck requests and are analyzing the situation.
Posted 11 days ago. Oct 14, 2017 - 00:48 CEST
Update
We've finished configuring the jumbo frames and are currently cleaning up expected stuck requests in the Ceph cluster. We'll update here once done.
Posted 11 days ago. Oct 14, 2017 - 00:31 CEST
Update
Maintenance is progressing, albeit a bit more slowly than anticipated. We're entering the last phase in which we'll activate jumbo frames in the next minutes and expect to be finished until 00:30 CEST.
Posted 11 days ago. Oct 13, 2017 - 23:51 CEST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted 11 days ago. Oct 13, 2017 - 22:00 CEST
Scheduled
We are going to further improve our network configuration for enhanced reliability and performance. This will affect all networks but specifically will increase the responsiveness of our storage cluster during periods of recovery.

The network changes include some settings (like "Ethernet Flow Control" and "Jumbo Frames") that will cause intermittent connectivity issues on all VLANs. We expect multiple short interruptions in the range of 10 seconds for individual servers applying their settings and one longer 15 minute interruption when switching the storage network to jumbo frames.
Posted 23 days ago. Oct 01, 2017 - 10:13 CEST