Partial storage server outage

Incident Report for Flying Circus

Resolved

The issue has been resolved. We have identified a number of storage servers that showed stuck requests after rebooting during a scheduled maintenance. Restarting the affected software components helped recovering the situation.

A first review of the issue shows no specific error but suggests a (rare) race condition within Ceph. We'll review the issue in depth at a later time.

Posted May 14, 2025 - 03:11 CEST

Investigating

We are seeing slow requests handled by our storage cluster and are currently investigating the issue.

Posted May 14, 2025 - 02:27 CEST

This incident affected: RZOB (production) (VM servers, VM storage cluster).