Storage performance analysis and kernel upgrade

Scheduled Maintenance Report for Flying Circus

Completed

The servers have been updated and we performed some experiments with our configuration to improve cluster behaviour when servers reboot under load. We have seen some improvements that seem like they will allow us to not cause service checks to time out (i.e. less than 60 seconds of slow requests) - we will review our experimental results in the next days and further adjust the config further in the future.

Posted Jun 29, 2017 - 03:41 CEST

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Jun 28, 2017 - 20:30 CEST

Scheduled

Tomorrow evening we are going to investigate a performance issue in our storage cluster that has been accumulating over the last weeks.

In general individual server outages and reboots are intended to be completely invisible with Ceph. At the moment this is true when one of our servers reboots or crashes. However, after the server joins the cluster we see increased latency and blocked IO for multiple minutes which disturbs customer VM activity.

We have not seen this behaviour in our testing environments and will perform two actions tomorrow: we will perform a pending kernel update on all servers and we will manually attend system reboots and watch the cluster's behaviour closely. We will conduct a few controlled experiments to try and tune the cluster's restart behaviour for less impact on customer VMs.

During this maintenance period we expect multiple slowdowns where VM disk IO may stall for a few minutes which may impact service quality.

Posted Jun 27, 2017 - 11:15 CEST