Flying Circus
All Systems Operational
VM servers   Operational
VM storage cluster   Operational
Network and Internet uplink   ? Operational
Central services   ? Operational
Related external services Operational
Bitbucket Git via HTTPS   Operational
Bitbucket Mercurial via HTTPS   Operational
Bitbucket SSH   Operational
GitHub   Operational
pypi.python.org   Operational
Fastly Europe (FRA)   Operational
Fastly Europe (AMS)   Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Oct 24, 2017

No incidents reported today.

Oct 23, 2017

No incidents reported.

Oct 22, 2017

No incidents reported.

Oct 21, 2017

No incidents reported.

Oct 20, 2017

No incidents reported.

Oct 19, 2017

No incidents reported.

Oct 18, 2017

No incidents reported.

Oct 17, 2017

No incidents reported.

Oct 16, 2017

No incidents reported.

Oct 15, 2017
Completed - We have finished updating all storage servers with new Linux kernels and Ceph updates as planned. We have seen intermittent reduced performance as expected.

The cluster is currently finishing up some additional replication tasks that aren't affecting performance.
Oct 15, 13:32 CEST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 15, 08:00 CEST
Scheduled - We are going to upgrade our Ceph cluster to the newest stable version with our release branch (Hammer) and will also upgrade to the latest kernel release in our long term release series (4.9). This fixes a few known issues in our underlying filesystem (XFS) and prepares our cluster for a larger Ceph version update in Q4 (Jewel).

This requires a reboot of all storage servers, which will be applied in a slow and staggered fashion to reduce potential impact. After our previous improvements, including our network optimizations, you might not see any impact of the recoveries. Nevertheless, as we have not yet seen a fully silent maintenance in production we're announcing this maintenance with the expectation of up to 1 minute of reduced storage performance for each of our 10 servers.
Oct 1, 10:20 CEST
Oct 14, 2017
Completed - Our statistics show that performance should be back to normal since around 05:00 CEST.

There's a little remaining recovery traffic ongoing without impact on production performance.
Oct 14, 08:07 CEST
Update - We have restarted all Ceph daemons as a measure to counter potential issues that resulted from the changed network changes "on the fly". We currently see a lot of recovery traffic in the cluster thus resulting in slow requests, but see the situation improving. We'll keep an eye on the situation for a little longer.
Oct 14, 02:31 CEST
Verifying - The amount of slow requests has been reduced, but performance is still under degraded performance.

We're seeing that all customer applications are generally performing their functional duties but also exhibit reduced performance, depending on the application's reliance on raw disk performance.

We're expecting this to improve over the next hours as the recovery in the story cluster progresses.
Oct 14, 02:02 CEST
Update - We're still seeing stuck requests and are analyzing the situation.
Oct 14, 00:48 CEST
Update - We've finished configuring the jumbo frames and are currently cleaning up expected stuck requests in the Ceph cluster. We'll update here once done.
Oct 14, 00:31 CEST
Update - Maintenance is progressing, albeit a bit more slowly than anticipated. We're entering the last phase in which we'll activate jumbo frames in the next minutes and expect to be finished until 00:30 CEST.
Oct 13, 23:51 CEST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 13, 22:00 CEST
Scheduled - We are going to further improve our network configuration for enhanced reliability and performance. This will affect all networks but specifically will increase the responsiveness of our storage cluster during periods of recovery.

The network changes include some settings (like "Ethernet Flow Control" and "Jumbo Frames") that will cause intermittent connectivity issues on all VLANs. We expect multiple short interruptions in the range of 10 seconds for individual servers applying their settings and one longer 15 minute interruption when switching the storage network to jumbo frames.
Oct 1, 10:13 CEST
Oct 12, 2017
Completed - The scheduled maintenance has been completed.
Oct 12, 00:00 CEST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 11, 21:00 CEST
Scheduled - Release 2017_023 is ready and will be rolled out during the specified timeframe.

See http://flyingcircus.io/doc/reference/changes/2017/r023.html for information about the specific changes.
Oct 10, 16:09 CEST
Oct 10, 2017

No incidents reported.