Reduced storage performance

Incident Report for Flying Circus

Resolved

The S3 pool is performing nominally again after the configuration reset. Recovery traffic is still happening but with negligible impact on customer performance.

The SSD pool is also performing nominally again after removal of the misbehaving SSDs.

We're considering the incident closed now and will perform additional debugging and problem resolution tasks in the next days in the background.

Posted Oct 09, 2020 - 12:42 CEST

Monitoring

We have identified a changed configuration (that was intended to improve recovery performance) to likely be the cause of the slow requests in the S3 pool. We have reverted that configuration and are now running recovery with throttled speed.

The SSD pool seems to be only sporadically affected by a small number of older NVME SSDs that did not exhibit those issues previously. We are taking those SSDs out of the cluster for later analysis.

Posted Oct 09, 2020 - 10:09 CEST

Identified

We are currently experiencing reduced storage performance both for SSD VM disks as well as the S3 object API.

This has started yesterday (Wednesday, 2020-10-09 around 11:30) but our status page update was not properly announced as we marked the subsystem as "reduced performance" but this did not create an incident as expected.

The biggest performance impact was visible until 17:00 yesterday. However, we are still seeing recovery traffic of the cluster (distributing data to the new servers) impacting the performance of the S3 API as well as sporadic very high IO latency on SSD VMs.

At the moment we suspect HDD VMs to not be affected.

To alleviate the problem for S3 we have disabled cluster recovery for now to see whether this helps.

We are currently investigating the SSD latency spikes that seem to be correlated with sporadic warnings we see on the underlying NVMe SSDs.

Posted Oct 09, 2020 - 09:35 CEST

This incident affected: RZOB (production) (VM storage cluster).