Storage slowness
Incident Report for Flying Circus
Resolved
This incident has been resolved.
Posted 2 months ago. Aug 17, 2017 - 07:16 CEST
Update
Sorry for the terse communication up until now. Stabilising the situation required our full attention.

We have currently achieved a temporary stable state that will require another maintenance window which we will start at 22:00 CEST tonight.

The trigger we found were issues in the network stack that caused issues when new TCP connections were opened from the virtual machine to the storages. Those connection attempts experienced extremely high timeouts and thus caused the VMs to have troubles accessing all or some of their disks.

The issue arose on multiple storage servers affecting varying numbers of virtual machines after a configuration update to the network config was applied on the storage servers. This configuration change was - after carefully testing it including on a single production server - incorrectly classified for "in service maintenance". We stopped the application of the update for now but will have to finish the change later tonight. We are taking precautions to limit the impact of the remaining changes, but might cause another period of up to 15 minutes of storage slowness.
Posted 2 months ago. Aug 16, 2017 - 20:23 CEST
Monitoring
The situation has stabilised now. We are still trying to identify the root cause.
Posted 2 months ago. Aug 16, 2017 - 20:06 CEST
Update
There is some sort of networking issue.
Posted 2 months ago. Aug 16, 2017 - 19:57 CEST
Update
We could not yet identify the cause of the problem. Currently some VMs are blocked due to storage access for some time and come back after some time.
Posted 2 months ago. Aug 16, 2017 - 19:23 CEST
Investigating
For some yet unknown reason some storage requests are very slow.
Posted 2 months ago. Aug 16, 2017 - 18:54 CEST