Storage slowness

Incident Report for Flying Circus

Resolved

This incident has been resolved.

Posted Aug 17, 2017 - 07:16 CEST

Update

Sorry for the terse communication up until now. Stabilising the situation required our full attention.

We have currently achieved a temporary stable state that will require another maintenance window which we will start at 22:00 CEST tonight.

The trigger we found were issues in the network stack that caused issues when new TCP connections were opened from the virtual machine to the storages. Those connection attempts experienced extremely high timeouts and thus caused the VMs to have troubles accessing all or some of their disks.

The issue arose on multiple storage servers affecting varying numbers of virtual machines after a configuration update to the network config was applied on the storage servers. This configuration change was - after carefully testing it including on a single production server - incorrectly classified for "in service maintenance". We stopped the application of the update for now but will have to finish the change later tonight. We are taking precautions to limit the impact of the remaining changes, but might cause another period of up to 15 minutes of storage slowness.

Posted Aug 16, 2017 - 20:23 CEST

Monitoring

The situation has stabilised now. We are still trying to identify the root cause.

Posted Aug 16, 2017 - 20:06 CEST

Update

There is some sort of networking issue.

Posted Aug 16, 2017 - 19:57 CEST

Update

We could not yet identify the cause of the problem. Currently some VMs are blocked due to storage access for some time and come back after some time.

Posted Aug 16, 2017 - 19:23 CEST

Investigating

For some yet unknown reason some storage requests are very slow.

Posted Aug 16, 2017 - 18:54 CEST