Outages and performance issues

Incident Report for Flying Circus

Postmortem

We have performed an in-depth analysis of the storage performance issue two weeks ago.

You can get the detailed post-mortem from our website: http://flyingcircus.io/postmortems/13271.pdf

Posted Mar 18, 2014 - 14:33 CET

Resolved

We have monitored overall performance and reliability since Friday and are happy to declare that the immediate issues are solved.

The new systems are not yet delivering the big performance impact that we wanted and show a mixed result:

* our backup jobs show that read bulk performance has improved as the jobs now finish around 9am instead of noon.
* small writes appear to be slower than before, bulk/sustained writes appear at least as fast as before

We have a few options on our list that we will try to improve performance - you will hear about those in the next days.
Posted Mar 10, 2014 - 12:53 CET

Update

Here's an overview of the impact that the bad storage performance and reliability had, over the last days. The times are taken from periods where our very sensitive IMAP server reported unavailability or extremely high response times - this should give a rough upper estimate of our global performance issues.

The mail server experienced a total outage of about 3 hours over the last 7 days: all individual other applications in our Pingdom monitoring show each less than 50 minutes of downtime in the same period.

Tuesday, 2014-03-04

* 16:16-16:39
* 18:01-18:03

Wednesday, 2014-03-05

* 13:23-13:26
* 16:07-16:08
* 17:20-17:21

Thursday, 2014-03-06

* 08:28-08:33
* 08:50-09:03
* 09:20-09:43
* 10:00-10:08
* 10:22-10:45
* 12:07-12:20
* 13:21-14:18
* 16:06-17:50

We're extremely sorry for the inconvenience. The overall status appears back to normal for now. We will follow up in the next days with more information how we will change our processes to avoid a similar outage in the future.
Posted Mar 07, 2014 - 15:37 CET

Update

After reviewing last night's storage cluster recovery we see that customer services (VM applications, mail, disk IO, interactive logins) appear to be normal again.

However, the performance improvement that we wanted to achieve with the new hardware did not happen and thus we're still monitoring the cluster and are also talking to our server vendor about options for optimization.

Looking at the various effects from the last days, customer applications experienced the following symptoms:

* responding with timeouts from applications or database due to slow disk reads and writes
* mails not being send (via SMTP) or accessible (via POP and IMAP) due to timeouts from the mail server
* high iowait values on VMs, even with little or only moderate IO
* low write performance in small writes (reasonable write performance on large operations)
* high latency when being logged in via SSH on a VM
Posted Mar 07, 2014 - 15:26 CET

Monitoring

In the last days our systems experienced multiple short and medium outages of (between 1 and 20 minutes) while we were installing new storage hardware.
Posted Mar 07, 2014 - 15:22 CET