The storage cluster is currently not processing requests

Incident Report for Flying Circus

Resolved

The cluster has recovered successfully and applications have recovered.

It appears that a regular reboot of one of the storage servers has caused a network connection overload that triggered a cascade of other storage servers shutting down their processes. A clean restart of the processes has stabilized the situation.

All data your is safe and all VMs have been running continuously and resumed operations when the cluster became operational again.

We will investigate the specific trigger of the cascade next week

Posted Jun 09, 2017 - 16:00 CEST

Monitoring

We saw a lot of crashed storage server processes. We have restarted them and see recovery. Further analysis will follow after we ensure that all services are back to normal. Performance is currently a bit reduced due to recovery traffic.

Posted Jun 09, 2017 - 15:30 CEST

Update

Due to some yet unknown reason several storage daemons stopped working at the same time. We are re-starting them now.

Posted Jun 09, 2017 - 15:19 CEST

Investigating

We are currently investigating this issue.

Posted Jun 09, 2017 - 15:13 CEST