The situation has resolved. The overall issue was a "thundering herd" effect of many (NixOS) machines unexpectedly pulling updates from our build server.
Earlier today we updated our build server software. Unexpectedly this resulted in an instantaneous rebuild of our NixOS platform (not a real release: nothing actually changed but some files that included the build version number were replaced). This resulted in many machines pulling those updates at the same time and starting to update those files on their disks. This resulted in an overload situation of the storage cluster that lasted from around 16:45 to around 17:20 CET.
We're sorry for the inconvenience. As it happens we discussed general improvements for our management utilities earlier today that will reduce the impact of those "thundering herd" moments.
Posted 3 months ago. Dec 06, 2017 - 17:36 CET
We're currently seeing higher response times from customer services and high IO latency in our storage cluster. We're investigating the situation.