An analysis of relevant Ceph configuration options showed that the margin between the threshold where ongoing backfills are suspended and where OSDs stop to service production traffic was too small. We are adjusting parameters to avoid such a condition in the future.
Posted Dec 30, 2015 - 09:04 CET
The reorganization caused a long downtime tonight (2:30 CET - 7:15 CET) which went unnoticed. We have fixed the issue a few minutes ago and VMs are resuming operation at the moment.
The issue was caused by a full Ceph storage node (OSD) due to imbalanced usage. The cluster thus stopped operation to avoid data loss.
Unfortunately our alerting did not notify us of the problem and we became aware of it when checking in regularly this morning. Our distributed alerting components seem to have an unknown dependency on the health of the Ceph cluster in this location - we are investigating this.
Posted Dec 30, 2015 - 07:19 CET
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Dec 28, 2015 - 15:00 CET
After the hardware upgrade, we will reorganize file systems on our storage servers. This will improve both performance and operability of the storage cluster. It is a proactive measure to prepare the storage cluster for future growth.
The cluster reorganisation will be an ongoing process spanning several days as we have to move lots of data. VMs will run continuously as always, but storage performance may become spiky at times. Short I/O hangs may also happen from time to time. We've put this maintenance time into the Christmas holiday season as we expect reduced general load during this week.