Storage system error

Incident Report for Flying Circus

Resolved

The defect system could not be reactivated on short term. We have begun ordering replacement parts. The second failed system seems to have stabilized and the cluster has full N+2 redundancy again.

Posted Aug 27, 2020 - 16:28 CEST

Update

We have found that a second server experienced an intermittent controller failure at the same time and deactivated its storage processes subsequently. When two servers fail we still have redundancy with a third copy, however, during this period Ceph will not perform IO on any of the affected objects as long as they only have 1 copy available. During automatic recovery this unblocks itself, however, as explained in the previous update, a few disks were filled up too far and thus stopped Ceph from creating new copies for some objects.

The second failed server has already recovered, the first server needs to be reactivated tomorrow with the help of on-site personnel.

Posted Aug 26, 2020 - 22:31 CEST

Monitoring

When the storage crashed, automatic recovery started as expected (that is: transparently, without outage). Due to unfairly balanced recovery a smaller disk was receiving more data than expected and started rejecting new data. In addition to that the cluster determined a number of object to have too few copies and blocked client operations of those requests. The combination of recovery being blocked and the cluster blocking operations lead to a kind of deadlock until we adjusted the balancing and removed pressure from the smaller disk. We are currently seeing all VMs back to regular operations while backfill is still going on.

We are also currently working on recovering the crashed storage host to get the cluster back to full capacity.

Posted Aug 26, 2020 - 21:47 CEST

Investigating

A single storage server failure should have to be transparently replicated, which didn't happen. It seems that at the same time disks on other servers became too slow (overloaded?).

Posted Aug 26, 2020 - 21:32 CEST

Identified

A storage server failed causing considerably degraded performance.

Posted Aug 26, 2020 - 21:15 CEST

This incident affected: RZOB (production) (VM storage cluster).