We stopped the running kernel builds and prevented machines from starting further builds. Additionally we temporarily halted cluster-internal replication traffic.
Kernel builds have shown a high toll on our storage in addition to the regular traffic. This interaction has not been seen before.
As a result of the previous storage issues and todays outage we are adjusting our release process substantially:
• Instead of releasing during the day, we will now release only during low traffic times (i.e. after 9pm).
• We will refactor kernel building into a mechanism that will limit parallel builds to reduce the storage impact.
Additionally, we will review our storage setup yet again as we have seen varied load patterns with storage servers from different generations.
We are sorry for the repeated interruptions and understand that this impacts your business critically.
We will follow up with further details about our plans for the further improvements in the next days.
Posted Feb 16, 2016 - 15:49 CET
Investigating
We are currently experiencing a performance issue with our storage cluster. We are investigating.
Unfortunately, our platform release that is currently being rolled out involves compiling a new Kernel for each VM. Since this causes a lot of I/O activity, we will try to stop the Kernel builds for now.