Storage improvement: additional SSD caching layer

Scheduled Maintenance Report for Flying Circus

Completed

The scheduled maintenance has been completed.

Posted Mar 06, 2016 - 21:00 CET

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Feb 29, 2016 - 21:00 CET

Scheduled

We are implementing a measure against one of the effects we saw during the recent outage: during periods of high activity applications became starved for IO and thus unresponsive. Our evaluation shows improvements by adding a second generous SSD-based read/write caching layer to all storage servers. In our preparation this showed improved overall latency and more predictable behaviour under high load scenarios.

Implementing this requires both a reboot of each storage server (to upgrade the RAID controller firmware) and a partial redistribution of data between the existing disks. To reduce the risk and impact of interruptions, we will implement this improvement on one server each day after 21:00 CET starting from Monday 2016-02-29 until Sunday 2016-03-06.

We have configured our environment for reduced impact during those operations, but applications may experience temporary slow-downs of disk throughput and IO latency resulting in higher response times.

Posted Feb 24, 2016 - 14:58 CET