Sorry for the terse communication up until now. Stabilising the situation required our full attention.
We have currently achieved a temporary stable state that will require another maintenance window which we will start at 22:00 CEST tonight.
The trigger we found were issues in the network stack that caused issues when new TCP connections were opened from the virtual machine to the storages. Those connection attempts experienced extremely high timeouts and thus caused the VMs to have troubles accessing all or some of their disks.
The issue arose on multiple storage servers affecting varying numbers of virtual machines after a configuration update to the network config was applied on the storage servers. This configuration change was - after carefully testing it including on a single production server - incorrectly classified for "in service maintenance". We stopped the application of the update for now but will have to finish the change later tonight. We are taking precautions to limit the impact of the remaining changes, but might cause another period of up to 15 minutes of storage slowness.
Posted 2 months ago. Aug 16, 2017 - 20:23 CEST
The situation has stabilised now. We are still trying to identify the root cause.
Posted 2 months ago. Aug 16, 2017 - 20:06 CEST
There is some sort of networking issue.
Posted 2 months ago. Aug 16, 2017 - 19:57 CEST
We could not yet identify the cause of the problem. Currently some VMs are blocked due to storage access for some time and come back after some time.
Posted 2 months ago. Aug 16, 2017 - 19:23 CEST
For some yet unknown reason some storage requests are very slow.