Network Maintenance with Downtime

Scheduled Maintenance Report for Flying Circus

Completed

Our statistics show that performance should be back to normal since around 05:00 CEST.

There's a little remaining recovery traffic ongoing without impact on production performance.

Posted Oct 14, 2017 - 08:07 CEST

Update

We have restarted all Ceph daemons as a measure to counter potential issues that resulted from the changed network changes "on the fly". We currently see a lot of recovery traffic in the cluster thus resulting in slow requests, but see the situation improving. We'll keep an eye on the situation for a little longer.

Posted Oct 14, 2017 - 02:31 CEST

Verifying

The amount of slow requests has been reduced, but performance is still under degraded performance.

We're seeing that all customer applications are generally performing their functional duties but also exhibit reduced performance, depending on the application's reliance on raw disk performance.

We're expecting this to improve over the next hours as the recovery in the story cluster progresses.

Posted Oct 14, 2017 - 02:02 CEST

Update

We're still seeing stuck requests and are analyzing the situation.

Posted Oct 14, 2017 - 00:48 CEST

Update

We've finished configuring the jumbo frames and are currently cleaning up expected stuck requests in the Ceph cluster. We'll update here once done.

Posted Oct 14, 2017 - 00:31 CEST

Update

Maintenance is progressing, albeit a bit more slowly than anticipated. We're entering the last phase in which we'll activate jumbo frames in the next minutes and expect to be finished until 00:30 CEST.

Posted Oct 13, 2017 - 23:51 CEST

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Oct 13, 2017 - 22:00 CEST

Scheduled

We are going to further improve our network configuration for enhanced reliability and performance. This will affect all networks but specifically will increase the responsiveness of our storage cluster during periods of recovery.

The network changes include some settings (like "Ethernet Flow Control" and "Jumbo Frames") that will cause intermittent connectivity issues on all VLANs. We expect multiple short interruptions in the range of 10 seconds for individual servers applying their settings and one longer 15 minute interruption when switching the storage network to jumbo frames.

Posted Oct 01, 2017 - 10:13 CEST