The cluster has recovered successfully and applications have recovered.
It appears that a regular reboot of one of the storage servers has caused a network connection overload that triggered a cascade of other storage servers shutting down their processes. A clean restart of the processes has stabilized the situation.
All data your is safe and all VMs have been running continuously and resumed operations when the cluster became operational again.
We will investigate the specific trigger of the cascade next week
Posted 5 months ago. Jun 09, 2017 - 16:00 CEST
We saw a lot of crashed storage server processes. We have restarted them and see recovery. Further analysis will follow after we ensure that all services are back to normal. Performance is currently a bit reduced due to recovery traffic.
Posted 5 months ago. Jun 09, 2017 - 15:30 CEST
Due to some yet unknown reason several storage daemons stopped working at the same time. We are re-starting them now.