Storage performance degradation

Incident Report for Flying Circus

Resolved

The situation has resolved. The overall issue was a "thundering herd" effect of many (NixOS) machines unexpectedly pulling updates from our build server.

Earlier today we updated our build server software. Unexpectedly this resulted in an instantaneous rebuild of our NixOS platform (not a real release: nothing actually changed but some files that included the build version number were replaced). This resulted in many machines pulling those updates at the same time and starting to update those files on their disks. This resulted in an overload situation of the storage cluster that lasted from around 16:45 to around 17:20 CET.

We're sorry for the inconvenience. As it happens we discussed general improvements for our management utilities earlier today that will reduce the impact of those "thundering herd" moments.

Posted Dec 06, 2017 - 17:36 CET

Investigating

We're currently seeing higher response times from customer services and high IO latency in our storage cluster. We're investigating the situation.

Posted Dec 06, 2017 - 16:55 CET